ML Pipeline Tracing Intermediate
Distributed tracing gives you end-to-end visibility into how data flows through your ML pipeline. By instrumenting each stage with OpenTelemetry spans, you can see exactly how long each step takes, where bottlenecks occur, and where failures originate—even when your pipeline spans multiple services and clusters.
OpenTelemetry for ML Pipelines
Python
from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider tracer = trace.get_tracer("ml-pipeline") def run_pipeline(data_path): with tracer.start_as_current_span("pipeline-run") as span: span.set_attribute("data.path", data_path) with tracer.start_as_current_span("data-ingestion"): data = ingest_data(data_path) with tracer.start_as_current_span("feature-engineering"): features = engineer_features(data) with tracer.start_as_current_span("model-training") as train_span: model = train_model(features) train_span.set_attribute("model.accuracy", model.accuracy)
Key Spans for ML Pipelines
- Data ingestion — Track records read, data source, file sizes, and parsing errors
- Feature engineering — Track feature computation time, feature count, and transformation errors
- Training — Track epochs, loss values, learning rate, and GPU utilization during training
- Evaluation — Track model metrics (accuracy, F1, AUC) and comparison with baseline
- Model registration — Track model artifact size, registry upload time, and version assignment
- Deployment — Track rollout strategy, canary percentage, and health check results
Trace Context Propagation
When ML pipeline stages run as separate services or Kubernetes jobs, trace context must be propagated between them. Use W3C Trace Context headers or pass trace IDs through message queues and job metadata.
Visualizing Traces with Jaeger
Jaeger provides a timeline view of your pipeline execution, showing:
- Waterfall view — See all spans in chronological order with parent-child relationships
- Critical path analysis — Identify which stage is the bottleneck in your pipeline
- Comparison view — Compare traces from different pipeline runs to spot regressions
- Service dependency map — Visualize how pipeline components communicate
ML-Specific Tip: Add custom span attributes for ML-relevant metadata like dataset version, model architecture, hyperparameters, and evaluation metrics. This allows you to correlate pipeline performance with model quality across runs.
Ready to Learn Structured Logging?
The next lesson covers implementing structured logging for ML pipelines with context propagation and aggregation.
Next: Logging →
Lilly Tech Systems