Intermediate
Evaluating Time Series Forecasts
Learn the right metrics and cross-validation strategies to reliably evaluate forecast quality without data leakage.
Forecast Metrics
| Metric | Formula | Pros | Cons |
|---|---|---|---|
| MAE | Mean Absolute Error | Easy to interpret, same units as data | Not scale-independent |
| RMSE | Root Mean Squared Error | Penalizes large errors more | Not scale-independent |
| MAPE | Mean Absolute Percentage Error | Scale-independent, intuitive (%) | Undefined when actual=0, asymmetric |
| SMAPE | Symmetric MAPE | Bounded, symmetric | Not truly symmetric |
| MASE | Mean Absolute Scaled Error | Scale-independent, handles zeros | Requires naive forecast baseline |
Python — Computing forecast metrics
import numpy as np
def mae(actual, predicted):
return np.mean(np.abs(actual - predicted))
def rmse(actual, predicted):
return np.sqrt(np.mean((actual - predicted) ** 2))
def mape(actual, predicted):
mask = actual != 0
return np.mean(np.abs((actual[mask] - predicted[mask]) / actual[mask])) * 100
def smape(actual, predicted):
denominator = (np.abs(actual) + np.abs(predicted)) / 2
mask = denominator != 0
return np.mean(np.abs(actual[mask] - predicted[mask]) / denominator[mask]) * 100
def mase(actual, predicted, training_series, season=1):
"""MASE: compares to naive seasonal forecast."""
naive_errors = np.abs(np.diff(training_series, n=season))
scale = np.mean(naive_errors)
return np.mean(np.abs(actual - predicted)) / scale
# Usage
print(f"MAE: {mae(y_test, y_pred):.2f}")
print(f"RMSE: {rmse(y_test, y_pred):.2f}")
print(f"MAPE: {mape(y_test, y_pred):.2f}%")
Time Series Cross-Validation
Standard k-fold cross-validation violates temporal ordering. Use expanding window or sliding window validation instead.
Python — Walk-forward validation
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
# TimeSeriesSplit: expanding window
tscv = TimeSeriesSplit(n_splits=5)
scores = []
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
score = rmse(y_val, y_pred)
scores.append(score)
print(f"Fold {fold+1}: RMSE = {score:.4f}")
print(f"Mean RMSE: {np.mean(scores):.4f} +/- {np.std(scores):.4f}")
Sliding Window Validation
Python — Fixed-size sliding window
def sliding_window_cv(X, y, train_size, test_size, step=1):
"""Sliding window with fixed training size."""
splits = []
for start in range(0, len(X) - train_size - test_size + 1, step):
train_end = start + train_size
test_end = train_end + test_size
train_idx = list(range(start, train_end))
test_idx = list(range(train_end, test_end))
splits.append((train_idx, test_idx))
return splits
# Use 365 days for training, forecast 30 days
splits = sliding_window_cv(X, y, train_size=365, test_size=30, step=30)
Baseline Models
Always compare your model against simple baselines:
- Naive forecast: Predict the last observed value (y_t+1 = y_t).
- Seasonal naive: Predict the value from the same period last season (y_t+1 = y_t-s).
- Mean forecast: Predict the historical mean.
- Drift forecast: Extrapolate the linear trend between first and last observation.
If your model can't beat the naive baseline, it's not useful. The seasonal naive forecast (same value as last week/year) is a surprisingly strong baseline for many business time series. Always report baseline comparisons in your evaluations.
Lilly Tech Systems