ML Pipeline Metrics Intermediate

Custom metrics provide the quantitative foundation for ML pipeline observability. Unlike logs and traces that help with debugging, metrics enable alerting, trend analysis, and SLO tracking. This lesson covers designing and implementing Prometheus metrics for every stage of your ML pipeline.

Essential Pipeline Metrics

Python
from prometheus_client import Counter, Histogram, Gauge

# Pipeline execution metrics
pipeline_runs_total = Counter('ml_pipeline_runs_total', 'Total pipeline runs', ['pipeline', 'status'])
pipeline_duration = Histogram('ml_pipeline_duration_seconds', 'Pipeline duration', ['pipeline', 'stage'])
pipeline_data_records = Counter('ml_pipeline_records_processed', 'Records processed', ['pipeline', 'stage'])

# Model quality metrics
model_accuracy = Gauge('ml_model_accuracy', 'Model accuracy score', ['model', 'version'])
model_training_loss = Gauge('ml_training_loss', 'Training loss', ['model', 'epoch'])

Metric Types for ML

Metric TypeML Use CasesExample
CounterPipeline runs, records processed, errorsml_pipeline_runs_total
GaugeModel accuracy, loss, active GPUs, queue depthml_model_accuracy
HistogramPipeline stage duration, inference latencyml_pipeline_duration_seconds
SummaryBatch processing times, feature computationml_batch_processing_seconds

Pipeline SLO Metrics

Define Service Level Objectives for your ML pipelines:

  • Pipeline completion rate — 99% of pipeline runs should complete successfully
  • Pipeline duration — 95% of pipeline runs should complete within 2 hours
  • Data freshness — Feature store data should be updated within 1 hour of source data arrival
  • Model quality gate — Model accuracy should not drop more than 2% from the baseline

Metric Naming Conventions

Follow a consistent naming convention for ML metrics:

  • Prefix with ml_ for all ML-related metrics
  • Include the component: ml_pipeline_, ml_training_, ml_serving_
  • Use standard suffixes: _total (counters), _seconds (durations), _bytes (sizes)
  • Keep label cardinality low to avoid Prometheus performance issues
Pro Tip: Expose a ml_pipeline_last_success_timestamp gauge for each pipeline. This makes it trivial to alert on stale pipelines: if the timestamp is more than N hours old, the pipeline has likely failed silently.

Ready to Learn Data Quality Monitoring?

The next lesson covers automated data quality monitoring throughout your ML pipelines.

Next: Data Quality →