Intermediate

Data Lineage

Data lineage tracks how data flows from its source through transformations to its final consumption point. For AI systems, lineage is essential for debugging, reproducibility, compliance, and impact analysis.

Levels of Lineage

LevelWhat It TracksUse Case
Table-levelWhich tables feed into which downstream tablesHigh-level dependency understanding
Column-levelWhich columns derive from which source columnsPrecise impact analysis and debugging
Row-levelWhich specific records contributed to an outputGDPR compliance, model explainability
Model-levelWhich data versions trained which model versionsML reproducibility and audit

Lineage Collection Methods

  • SQL parsing: Parse SQL queries to extract source-target relationships and column-level transformations
  • Pipeline metadata: Extract lineage from orchestrators (Airflow DAGs, dbt models, Spark jobs)
  • Query log analysis: Mine query logs from data warehouses to infer lineage from actual usage
  • API instrumentation: Emit lineage events from data processing code using OpenLineage or similar standards
  • Manual annotation: For legacy systems where automated collection is not feasible

Lineage for ML Systems

ML systems require extended lineage that connects data to models:

  • Feature lineage: Source data → transformation → feature store → feature vector → model input
  • Training lineage: Which data snapshot, feature versions, and hyperparameters produced which model version
  • Inference lineage: For a specific prediction, which feature values and model version were used
  • Feedback lineage: How prediction outcomes feed back into retraining data

Impact Analysis

Lineage enables impact analysis — answering "what breaks if this changes?":

  • Upstream impact: If a source table schema changes, which downstream features and models are affected?
  • Downstream impact: If this model's predictions change, which business processes are affected?
  • Cross-system impact: If this API is deprecated, which data pipelines and models depend on it?
💡
OpenLineage standard: Consider adopting the OpenLineage standard for lineage event emission. It provides a vendor-neutral specification that works across tools (Airflow, Spark, dbt, Flink) and integrates with multiple catalog platforms.