ML Pipeline Metrics Intermediate

Custom metrics provide the quantitative foundation for ML pipeline observability. Unlike logs and traces that help with debugging, metrics enable alerting, trend analysis, and SLO tracking. This lesson covers designing and implementing Prometheus metrics for every stage of your ML pipeline.

Essential Pipeline Metrics

Python

from prometheus_client import Counter, Histogram, Gauge

# Pipeline execution metrics
pipeline_runs_total = Counter('ml_pipeline_runs_total', 'Total pipeline runs', ['pipeline', 'status'])
pipeline_duration = Histogram('ml_pipeline_duration_seconds', 'Pipeline duration', ['pipeline', 'stage'])
pipeline_data_records = Counter('ml_pipeline_records_processed', 'Records processed', ['pipeline', 'stage'])

# Model quality metrics
model_accuracy = Gauge('ml_model_accuracy', 'Model accuracy score', ['model', 'version'])
model_training_loss = Gauge('ml_training_loss', 'Training loss', ['model', 'epoch'])

Metric Types for ML

Metric Type	ML Use Cases	Example
Counter	Pipeline runs, records processed, errors	`ml_pipeline_runs_total`
Gauge	Model accuracy, loss, active GPUs, queue depth	`ml_model_accuracy`
Histogram	Pipeline stage duration, inference latency	`ml_pipeline_duration_seconds`
Summary	Batch processing times, feature computation	`ml_batch_processing_seconds`

Pipeline SLO Metrics

Define Service Level Objectives for your ML pipelines:

Pipeline completion rate — 99% of pipeline runs should complete successfully
Pipeline duration — 95% of pipeline runs should complete within 2 hours
Data freshness — Feature store data should be updated within 1 hour of source data arrival
Model quality gate — Model accuracy should not drop more than 2% from the baseline

Metric Naming Conventions

Follow a consistent naming convention for ML metrics:

Prefix with ml_ for all ML-related metrics
Include the component: ml_pipeline_, ml_training_, ml_serving_
Use standard suffixes: _total (counters), _seconds (durations), _bytes (sizes)
Keep label cardinality low to avoid Prometheus performance issues

Pro Tip: Expose a ml_pipeline_last_success_timestamp gauge for each pipeline. This makes it trivial to alert on stale pipelines: if the timestamp is more than N hours old, the pipeline has likely failed silently.

Ready to Learn Data Quality Monitoring?

The next lesson covers automated data quality monitoring throughout your ML pipelines.

Next: Data Quality →

← Logging Data Quality →