Intermediate

Data Loading for Machine Learning

Load transformed data into the right destinations — feature stores, data warehouses, and optimized file formats — for model training and real-time serving.

Loading Destinations for ML

Where you load your data depends on how it will be consumed:

Destination	Use Case	Access Pattern
Feature Store	Online + offline features	Key-value lookups, batch reads
Data Warehouse	Batch training datasets	SQL queries over large tables
Object Storage	Training data files	Sequential reads (Parquet, TFRecord)
Vector Database	Embedding similarity	Nearest neighbor search
Cache (Redis)	Real-time inference features	Sub-millisecond key lookups

Loading to Feature Stores

Feature stores solve the training-serving skew problem by providing a single source of truth for features:

from feast import FeatureStore, Entity, FeatureView, Field
from feast.types import Float32, Int64

# Define a feature view in Feast
driver_stats = FeatureView(
    name="driver_hourly_stats",
    entities=[Entity(name="driver_id", join_keys=["driver_id"])],
    schema=[
        Field(name="conv_rate", dtype=Float32),
        Field(name="acc_rate", dtype=Float32),
        Field(name="avg_daily_trips", dtype=Int64),
    ],
    source=driver_stats_source,
)

# Materialize features for online serving
store = FeatureStore(repo_path="feature_repo/")
store.materialize_incremental(end_date=datetime.now())

💡

Online vs Offline stores: Feature stores maintain two copies: an offline store (data warehouse) for batch training data retrieval and an online store (Redis, DynamoDB) for low-latency serving during inference.

Optimized File Formats

Choosing the right file format significantly impacts training speed and storage costs:

Format	Compression	Column Access	Best For
Parquet	Excellent	Yes (columnar)	Tabular ML data, Spark/Pandas
TFRecord	Good	No (row-based)	TensorFlow training pipelines
Arrow/Feather	Good	Yes (columnar)	Fast in-memory interchange
Delta Lake	Excellent	Yes (columnar)	Versioned, ACID-compliant datasets
CSV	None	No	Small datasets, debugging only

import pyarrow as pa
import pyarrow.parquet as pq

# Write partitioned Parquet files
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
    table,
    root_path="s3://ml-data/training/",
    partition_cols=["date", "category"],
    compression="snappy"
)

Loading Patterns

Full Refresh

Replace the entire destination table. Simple but expensive. Use for small reference tables or when incremental logic is too complex.

Append Only

Add new records without modifying existing ones. Best for event/log data and time-series features.

Upsert (Merge)

Insert new records and update existing ones based on a key. The most common pattern for dimension tables.

Partition Overwrite

Overwrite specific partitions (e.g., by date) without touching others. Balances simplicity and efficiency.

Data Versioning

Version your training datasets so you can reproduce any model's results:

# Using DVC for data versioning
# dvc init
# dvc add data/training_v2.parquet
# git add data/training_v2.parquet.dvc
# git commit -m "Add training dataset v2"
# dvc push

# Using Delta Lake for versioned tables
from delta import DeltaTable

# Write with versioning
df.write.format("delta").mode("overwrite").save("/data/features")

# Read a specific version
df_v2 = spark.read.format("delta").option("versionAsOf", 2).load("/data/features")

✅

Always version your training data alongside your model artifacts. When a model degrades in production, you need to know exactly which data was used to train it. Link dataset versions to experiment tracking entries in MLflow or W&B.

← Previous Transformation Next → Airflow