Intermediate

Data Versioning

Track datasets with DVC, configure remote storage, and define data pipelines.

DVC Setup

# Initialize DVC and add remote storage
dvc init
dvc remote add -d myremote s3://my-bucket/dvc-store
# Or use local storage for development:
dvc remote add -d myremote /tmp/dvc-store

# Track a dataset
dvc add data/training_data.csv
git add data/training_data.csv.dvc data/.gitignore
git commit -m "Track training data with DVC"

# Push data to remote
dvc push

DVC Pipeline

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps: [src/prepare.py, data/raw/]
    outs: [data/processed/]

  train:
    cmd: python src/train.py
    deps: [src/train.py, data/processed/]
    params: [params.yaml]
    outs: [models/]
    metrics: [metrics.json]

  evaluate:
    cmd: python src/evaluate.py
    deps: [src/evaluate.py, models/, data/processed/]
    metrics: [eval_metrics.json]
# params.yaml
model:
  type: random_forest
  n_estimators: 100
  max_depth: 10
  test_size: 0.2
  random_state: 42
# src/prepare.py
import pandas as pd
import os

def prepare_data():
    os.makedirs("data/processed", exist_ok=True)
    df = pd.read_csv("data/raw/dataset.csv")
    df = df.dropna()
    df = df.drop_duplicates()

    from sklearn.model_selection import train_test_split
    train, test = train_test_split(df, test_size=0.2, random_state=42)
    train.to_csv("data/processed/train.csv", index=False)
    test.to_csv("data/processed/test.csv", index=False)
    print(f"Train: {len(train)}, Test: {len(test)}")

if __name__ == "__main__":
    prepare_data()

Run Pipeline

dvc repro          # Run full pipeline
dvc metrics show   # Show metrics
dvc diff           # Compare with previous run
📦
DVC advantage: Datasets are version-controlled alongside code. git checkout v1.0 automatically gets the matching dataset version when you run dvc checkout.