Intermediate

Evaluating Time Series Forecasts

Learn the right metrics and cross-validation strategies to reliably evaluate forecast quality without data leakage.

Forecast Metrics

MetricFormulaProsCons
MAEMean Absolute ErrorEasy to interpret, same units as dataNot scale-independent
RMSERoot Mean Squared ErrorPenalizes large errors moreNot scale-independent
MAPEMean Absolute Percentage ErrorScale-independent, intuitive (%)Undefined when actual=0, asymmetric
SMAPESymmetric MAPEBounded, symmetricNot truly symmetric
MASEMean Absolute Scaled ErrorScale-independent, handles zerosRequires naive forecast baseline
Python — Computing forecast metrics
import numpy as np

def mae(actual, predicted):
    return np.mean(np.abs(actual - predicted))

def rmse(actual, predicted):
    return np.sqrt(np.mean((actual - predicted) ** 2))

def mape(actual, predicted):
    mask = actual != 0
    return np.mean(np.abs((actual[mask] - predicted[mask]) / actual[mask])) * 100

def smape(actual, predicted):
    denominator = (np.abs(actual) + np.abs(predicted)) / 2
    mask = denominator != 0
    return np.mean(np.abs(actual[mask] - predicted[mask]) / denominator[mask]) * 100

def mase(actual, predicted, training_series, season=1):
    """MASE: compares to naive seasonal forecast."""
    naive_errors = np.abs(np.diff(training_series, n=season))
    scale = np.mean(naive_errors)
    return np.mean(np.abs(actual - predicted)) / scale

# Usage
print(f"MAE:   {mae(y_test, y_pred):.2f}")
print(f"RMSE:  {rmse(y_test, y_pred):.2f}")
print(f"MAPE:  {mape(y_test, y_pred):.2f}%")

Time Series Cross-Validation

Standard k-fold cross-validation violates temporal ordering. Use expanding window or sliding window validation instead.

Python — Walk-forward validation
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

# TimeSeriesSplit: expanding window
tscv = TimeSeriesSplit(n_splits=5)

scores = []
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    score = rmse(y_val, y_pred)
    scores.append(score)
    print(f"Fold {fold+1}: RMSE = {score:.4f}")

print(f"Mean RMSE: {np.mean(scores):.4f} +/- {np.std(scores):.4f}")

Sliding Window Validation

Python — Fixed-size sliding window
def sliding_window_cv(X, y, train_size, test_size, step=1):
    """Sliding window with fixed training size."""
    splits = []
    for start in range(0, len(X) - train_size - test_size + 1, step):
        train_end = start + train_size
        test_end = train_end + test_size
        train_idx = list(range(start, train_end))
        test_idx = list(range(train_end, test_end))
        splits.append((train_idx, test_idx))
    return splits

# Use 365 days for training, forecast 30 days
splits = sliding_window_cv(X, y, train_size=365, test_size=30, step=30)

Baseline Models

Always compare your model against simple baselines:

  • Naive forecast: Predict the last observed value (y_t+1 = y_t).
  • Seasonal naive: Predict the value from the same period last season (y_t+1 = y_t-s).
  • Mean forecast: Predict the historical mean.
  • Drift forecast: Extrapolate the linear trend between first and last observation.
If your model can't beat the naive baseline, it's not useful. The seasonal naive forecast (same value as last week/year) is a surprisingly strong baseline for many business time series. Always report baseline comparisons in your evaluations.