Intermediate
Data Lineage
Data lineage tracks how data flows from its source through transformations to its final consumption point. For AI systems, lineage is essential for debugging, reproducibility, compliance, and impact analysis.
Levels of Lineage
| Level | What It Tracks | Use Case |
|---|---|---|
| Table-level | Which tables feed into which downstream tables | High-level dependency understanding |
| Column-level | Which columns derive from which source columns | Precise impact analysis and debugging |
| Row-level | Which specific records contributed to an output | GDPR compliance, model explainability |
| Model-level | Which data versions trained which model versions | ML reproducibility and audit |
Lineage Collection Methods
- SQL parsing: Parse SQL queries to extract source-target relationships and column-level transformations
- Pipeline metadata: Extract lineage from orchestrators (Airflow DAGs, dbt models, Spark jobs)
- Query log analysis: Mine query logs from data warehouses to infer lineage from actual usage
- API instrumentation: Emit lineage events from data processing code using OpenLineage or similar standards
- Manual annotation: For legacy systems where automated collection is not feasible
Lineage for ML Systems
ML systems require extended lineage that connects data to models:
- Feature lineage: Source data → transformation → feature store → feature vector → model input
- Training lineage: Which data snapshot, feature versions, and hyperparameters produced which model version
- Inference lineage: For a specific prediction, which feature values and model version were used
- Feedback lineage: How prediction outcomes feed back into retraining data
Impact Analysis
Lineage enables impact analysis — answering "what breaks if this changes?":
- Upstream impact: If a source table schema changes, which downstream features and models are affected?
- Downstream impact: If this model's predictions change, which business processes are affected?
- Cross-system impact: If this API is deprecated, which data pipelines and models depend on it?
OpenLineage standard: Consider adopting the OpenLineage standard for lineage event emission. It provides a vendor-neutral specification that works across tools (Airflow, Spark, dbt, Flink) and integrates with multiple catalog platforms.
Lilly Tech Systems