ML Pipeline Metrics Intermediate
Custom metrics provide the quantitative foundation for ML pipeline observability. Unlike logs and traces that help with debugging, metrics enable alerting, trend analysis, and SLO tracking. This lesson covers designing and implementing Prometheus metrics for every stage of your ML pipeline.
Essential Pipeline Metrics
Python
from prometheus_client import Counter, Histogram, Gauge # Pipeline execution metrics pipeline_runs_total = Counter('ml_pipeline_runs_total', 'Total pipeline runs', ['pipeline', 'status']) pipeline_duration = Histogram('ml_pipeline_duration_seconds', 'Pipeline duration', ['pipeline', 'stage']) pipeline_data_records = Counter('ml_pipeline_records_processed', 'Records processed', ['pipeline', 'stage']) # Model quality metrics model_accuracy = Gauge('ml_model_accuracy', 'Model accuracy score', ['model', 'version']) model_training_loss = Gauge('ml_training_loss', 'Training loss', ['model', 'epoch'])
Metric Types for ML
| Metric Type | ML Use Cases | Example |
|---|---|---|
| Counter | Pipeline runs, records processed, errors | ml_pipeline_runs_total |
| Gauge | Model accuracy, loss, active GPUs, queue depth | ml_model_accuracy |
| Histogram | Pipeline stage duration, inference latency | ml_pipeline_duration_seconds |
| Summary | Batch processing times, feature computation | ml_batch_processing_seconds |
Pipeline SLO Metrics
Define Service Level Objectives for your ML pipelines:
- Pipeline completion rate — 99% of pipeline runs should complete successfully
- Pipeline duration — 95% of pipeline runs should complete within 2 hours
- Data freshness — Feature store data should be updated within 1 hour of source data arrival
- Model quality gate — Model accuracy should not drop more than 2% from the baseline
Metric Naming Conventions
Follow a consistent naming convention for ML metrics:
- Prefix with
ml_for all ML-related metrics - Include the component:
ml_pipeline_,ml_training_,ml_serving_ - Use standard suffixes:
_total(counters),_seconds(durations),_bytes(sizes) - Keep label cardinality low to avoid Prometheus performance issues
Pro Tip: Expose a
ml_pipeline_last_success_timestamp gauge for each pipeline. This makes it trivial to alert on stale pipelines: if the timestamp is more than N hours old, the pipeline has likely failed silently.
Ready to Learn Data Quality Monitoring?
The next lesson covers automated data quality monitoring throughout your ML pipelines.
Next: Data Quality →
Lilly Tech Systems