Best Practices & Checklist Advanced

This final lesson consolidates everything into actionable checklists by model type, compares the leading monitoring tools, and answers the most common questions teams have when building ML monitoring systems.

Monitoring Checklist by Model Type

Classification Models (Fraud, Spam, Moderation)

# Classification Model Monitoring Checklist
CLASSIFICATION_CHECKLIST = {
    "pre_deployment": [
        "Record baseline metrics (accuracy, precision, recall, F1, AUC)",
        "Store reference data distribution for all features",
        "Define prediction distribution baseline (expected positive rate)",
        "Set up ground truth pipeline with expected delay",
        "Identify proxy metrics for the ground truth delay period",
        "Create runbook with rollback procedure",
        "Configure alert rules (see below)",
        "Build dashboard with 4-pillar coverage",
    ],
    "critical_alerts": [
        "Error rate > 1% (P0)",
        "Prediction collapse: single class > 95% of predictions (P1)",
        "Prediction volume drops > 50% vs same time yesterday (P1)",
        "P99 latency > 5x baseline (P1)",
        "All features null / missing (P0)",
    ],
    "warning_alerts": [
        "Accuracy drops > 5% from baseline (P2)",
        "PSI > 0.2 on any top-10 important feature (P2)",
        "Positive prediction rate shifts > 3x from baseline (P2)",
        "Feature null rate > 5% (P2)",
        "Ground truth pipeline delayed > 2x expected (P3)",
    ],
    "ongoing_monitoring": [
        "Daily: review drift dashboard, check accuracy metrics",
        "Weekly: review alert-to-action ratio, tune noisy alerts",
        "Monthly: compare model vs baseline, evaluate retrain need",
        "Quarterly: review full monitoring coverage, add new metrics",
    ]
}

Regression Models (Pricing, Demand Forecasting, Risk Scoring)

REGRESSION_CHECKLIST = {
    "pre_deployment": [
        "Record baseline MAE, RMSE, MAPE, R-squared",
        "Store prediction distribution (mean, std, percentiles)",
        "Define residual distribution baseline",
        "Set up calibration monitoring (predicted vs actual)",
        "Identify business-relevant error thresholds",
    ],
    "critical_alerts": [
        "Prediction mean shifts > 3 std from baseline (P1)",
        "Prediction std collapses to near-zero (P1, model outputting constant)",
        "MAE/RMSE increases > 50% from baseline (P1)",
        "All predictions in extreme range (top/bottom 5%) (P1)",
    ],
    "warning_alerts": [
        "Feature drift PSI > 0.2 on important features (P2)",
        "Prediction distribution shift (KS test p < 0.01) (P2)",
        "Calibration error increases (predicted vs actual gap) (P2)",
        "Residual distribution shifts (may indicate concept drift) (P3)",
    ]
}

LLM Applications (Chatbots, RAG, Agents)

LLM_CHECKLIST = {
    "pre_deployment": [
        "Set daily/hourly cost budgets with auto-fallback model",
        "Define quality evaluation criteria (relevance, accuracy, format)",
        "Configure guardrails (content filters, PII detection, topic bounds)",
        "Set up user feedback collection (thumbs up/down, ratings)",
        "Create prompt versioning and performance tracking",
        "Define SLOs for latency (TTFT, total response time)",
        "Implement semantic caching for cost reduction",
    ],
    "critical_alerts": [
        "API error rate > 5% (P1)",
        "Daily cost exceeds budget (P1)",
        "TTFT > 5s for > 10% of requests (P1)",
        "Guardrail trigger rate spikes > 3x baseline (P1)",
        "Provider outage detected (P0 if no fallback)",
    ],
    "warning_alerts": [
        "Cost per query increases > 2x (P2)",
        "User satisfaction drops below 70% (P2)",
        "Token efficiency ratio degrades (output/input) (P3)",
        "New prompt template performs worse than previous (P2)",
        "Hallucination rate increases (via grounding checks) (P2)",
    ],
    "ongoing_monitoring": [
        "Daily: review cost report, check quality scores",
        "Weekly: compare prompt template performance, review user feedback",
        "Monthly: evaluate model provider pricing changes, review safety logs",
        "Quarterly: benchmark against newer models, audit guardrail coverage",
    ]
}

Tool Comparison

ToolTypeBest ForPricingKey Strength
Evidently AIOpen source + cloudData drift, model quality reportsFree (OSS) / paid cloudBeautiful HTML reports, easy to start, great for drift
WhyLabsSaaSContinuous monitoring at scaleFree tier + enterprisewhylogs profiling (lightweight), anomaly detection
Arize AISaaSRoot cause analysis, embeddings monitoringFree tier + enterpriseEmbedding drift, slice-based analysis, LLM tracing
Fiddler AISaaSExplainability + monitoringEnterpriseModel explainability built into monitoring
Prometheus + GrafanaOpen sourceInfrastructure + custom ML metricsFreeIndustry standard, massive ecosystem, highly customizable
Custom (Python + DB)Build your ownFull control, specific requirementsEngineering timeExactly what you need, no vendor lock-in

When to Use What

# Decision framework for choosing monitoring tools

def recommend_monitoring_stack(
    team_size: int,
    num_models: int,
    budget: str,       # "zero", "low", "medium", "high"
    model_types: list  # ["classification", "regression", "llm", "embedding"]
) -> dict:
    """Recommend a monitoring stack based on team constraints."""

    recommendations = {
        "infrastructure": "Prometheus + Grafana (always)",
        "reasoning": []
    }

    if budget == "zero":
        recommendations["drift_monitoring"] = "Evidently AI (open source)"
        recommendations["dashboards"] = "Grafana"
        recommendations["alerting"] = "Grafana Alerting or Alertmanager"
        recommendations["reasoning"].append(
            "Zero budget: use open source stack. "
            "Evidently for drift reports, Prometheus+Grafana for everything else."
        )
    elif budget == "low" and num_models <= 5:
        recommendations["drift_monitoring"] = "WhyLabs (free tier)"
        recommendations["dashboards"] = "WhyLabs + Grafana"
        recommendations["alerting"] = "WhyLabs alerts + PagerDuty free"
        recommendations["reasoning"].append(
            "Low budget with few models: WhyLabs free tier covers basics."
        )
    elif "llm" in model_types:
        recommendations["llm_monitoring"] = "Arize AI or LangSmith"
        recommendations["drift_monitoring"] = "Evidently or Arize"
        recommendations["reasoning"].append(
            "LLM workloads benefit from Arize's embedding drift "
            "and LLM tracing capabilities."
        )
    else:
        recommendations["drift_monitoring"] = "Evidently Cloud or Arize"
        recommendations["dashboards"] = "Grafana + tool dashboards"
        recommendations["reasoning"].append(
            "Medium/high budget: use a dedicated ML monitoring platform "
            "for drift + performance, keep Grafana for infrastructure."
        )

    if team_size <= 3:
        recommendations["reasoning"].append(
            "Small team: minimize custom code. "
            "Use managed tools to avoid maintenance burden."
        )
    elif team_size >= 10:
        recommendations["reasoning"].append(
            "Larger team: consider building custom monitoring "
            "for critical paths where vendor tools don't fit."
        )

    return recommendations


# Example
stack = recommend_monitoring_stack(
    team_size=5,
    num_models=8,
    budget="low",
    model_types=["classification", "llm"]
)
for k, v in stack.items():
    print(f"  {k}: {v}")

Frequently Asked Questions

How often should I check for data drift?

It depends on your data velocity and risk tolerance. For real-time systems (fraud, recommendations), check hourly. For batch systems (daily scoring), check daily before each batch run. For low-risk systems, weekly may be sufficient. Start with hourly checks and reduce frequency if you find drift is rare and the computational cost is a concern.

What's the minimum monitoring I need before deploying a model?

At absolute minimum: (1) error rate and latency tracking, (2) prediction volume monitoring, (3) prediction distribution monitoring (catch prediction collapse), and (4) a way to rollback quickly. These four things catch the most critical failures. Add drift detection, performance tracking, and business metrics as your monitoring matures.

Should I use a managed monitoring tool or build my own?

Start with managed tools (Evidently, WhyLabs, Arize) unless you have very specific requirements. Building custom monitoring is expensive in engineering time and maintenance. A good rule of thumb: if you have fewer than 5 ML engineers, use managed tools. If you have 10+, consider building custom components for your most critical monitoring needs while using managed tools for the rest.

How do I monitor fairness in production?

Track model performance across demographic slices (when legally permitted and with appropriate privacy controls). Monitor prediction rates, error rates, and outcomes by group. Use disparate impact ratio (should be between 0.8 and 1.25) and equalized odds as key metrics. Tools like Evidently and Fiddler have built-in fairness monitoring. Review fairness metrics monthly and after any model update.

What's the best way to handle monitoring for models with very delayed ground truth?

Use a three-layer approach: (1) Immediate: monitor data quality, prediction distribution, and proxy metrics. (2) Short-term: use sampled human evaluation on a small percentage of predictions (e.g., 1%). (3) Long-term: compute actual performance metrics when ground truth arrives and compare against the proxy signals. Over time, validate that your proxy metrics correlate with actual performance.

How do I convince my team/manager to invest in ML monitoring?

Frame it in terms of risk and cost: (1) Calculate the cost of a silent ML failure running for 1 day, 1 week, 1 month (lost revenue, bad decisions, customer impact). (2) Reference real incidents (Zillow lost $569M partly due to insufficient model monitoring). (3) Start small with the minimum monitoring checklist and expand. (4) Show that monitoring catches issues earlier, reducing incident response time and costly retraining.

How do I monitor an ensemble of models?

Monitor each component model individually AND monitor the ensemble output. Track agreement rate between component models (sudden disagreement may signal an issue with one model). Monitor the weighting/routing logic. For voting ensembles, track the confidence margin. For stacked models, monitor the meta-learner separately from base models.

What Prometheus metric types should I use for ML?

Counter: prediction count, error count, ground truth joins. Histogram: latency, prediction scores, feature values. Gauge: accuracy, drift scores, active model version, cost, GPU utilization. Summary: use sparingly; histograms are more flexible. Name metrics with the prefix ml_ to distinguish from application metrics.

Final Production Checklist

# Complete pre-deployment monitoring checklist generator
def generate_full_checklist(model_name: str, model_type: str,
                             has_llm: bool = False) -> str:
    """Generate a complete monitoring checklist for any model."""
    lines = [
        f"=== MONITORING CHECKLIST: {model_name} ({model_type}) ===",
        "",
        "## BEFORE DEPLOYMENT",
        "[ ] Baseline metrics recorded and stored",
        "[ ] Reference data distribution saved for drift comparison",
        "[ ] Prediction distribution baseline captured",
        "[ ] Ground truth pipeline tested end-to-end",
        "[ ] Proxy metrics identified and validated",
        "[ ] Rollback procedure documented and tested",
        "[ ] Runbook created with triage steps",
        "",
        "## ALERTS CONFIGURED",
        "[ ] P0: Model serving health (error rate, availability)",
        "[ ] P1: Prediction collapse detection",
        "[ ] P1: Prediction volume anomaly",
        "[ ] P1: Latency spike detection",
        "[ ] P2: Data drift (PSI > 0.2)",
        "[ ] P2: Performance degradation (accuracy drop > 5%)",
        "[ ] P3: Feature null rate increase",
        "[ ] Alert routing tested (Slack, PagerDuty)",
        "[ ] Alert cooldown periods configured",
        "",
        "## DASHBOARDS",
        "[ ] On-call dashboard (real-time health)",
        "[ ] ML team dashboard (performance + drift)",
        "[ ] Executive dashboard (business impact + cost)",
        "",
        "## ONGOING",
        "[ ] Daily monitoring review process defined",
        "[ ] Weekly alert tuning review scheduled",
        "[ ] Monthly model performance review scheduled",
        "[ ] Quarterly monitoring coverage audit scheduled",
    ]

    if has_llm:
        lines.extend([
            "",
            "## LLM-SPECIFIC",
            "[ ] Cost budgets and auto-fallback configured",
            "[ ] Token usage tracking enabled",
            "[ ] Guardrail monitoring configured",
            "[ ] Quality scoring pipeline set up",
            "[ ] User feedback collection enabled",
            "[ ] Prompt versioning and tracking in place",
            "[ ] Semantic caching configured (if applicable)",
        ])

    lines.extend([
        "",
        "## SIGN-OFF",
        "[ ] ML engineer reviewed and approved",
        "[ ] On-call engineer reviewed runbook",
        "[ ] Product owner aware of monitoring coverage",
        f"[ ] Date: ________  Approved by: ________",
    ])

    return "\n".join(lines)


print(generate_full_checklist("fraud-detector-v2", "classification"))
print()
print(generate_full_checklist("support-chatbot", "llm", has_llm=True))
Remember: The goal of monitoring is not to collect metrics — it's to detect problems before your users do. Start with the minimum viable monitoring (error rate, prediction volume, prediction distribution), then expand based on what actually causes incidents in your system. Every alert should have a clear action. If you don't know what to do when an alert fires, either write a runbook or remove the alert.