Intermediate
Data Versioning
Track datasets with DVC, configure remote storage, and define data pipelines.
DVC Setup
# Initialize DVC and add remote storage
dvc init
dvc remote add -d myremote s3://my-bucket/dvc-store
# Or use local storage for development:
dvc remote add -d myremote /tmp/dvc-store
# Track a dataset
dvc add data/training_data.csv
git add data/training_data.csv.dvc data/.gitignore
git commit -m "Track training data with DVC"
# Push data to remote
dvc push
DVC Pipeline
# dvc.yaml
stages:
prepare:
cmd: python src/prepare.py
deps: [src/prepare.py, data/raw/]
outs: [data/processed/]
train:
cmd: python src/train.py
deps: [src/train.py, data/processed/]
params: [params.yaml]
outs: [models/]
metrics: [metrics.json]
evaluate:
cmd: python src/evaluate.py
deps: [src/evaluate.py, models/, data/processed/]
metrics: [eval_metrics.json]
# params.yaml
model:
type: random_forest
n_estimators: 100
max_depth: 10
test_size: 0.2
random_state: 42
# src/prepare.py
import pandas as pd
import os
def prepare_data():
os.makedirs("data/processed", exist_ok=True)
df = pd.read_csv("data/raw/dataset.csv")
df = df.dropna()
df = df.drop_duplicates()
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, random_state=42)
train.to_csv("data/processed/train.csv", index=False)
test.to_csv("data/processed/test.csv", index=False)
print(f"Train: {len(train)}, Test: {len(test)}")
if __name__ == "__main__":
prepare_data()
Run Pipeline
dvc repro # Run full pipeline
dvc metrics show # Show metrics
dvc diff # Compare with previous run
DVC advantage: Datasets are version-controlled alongside code.
git checkout v1.0 automatically gets the matching dataset version when you run dvc checkout.
Lilly Tech Systems