Intermediate

Data Loading for Machine Learning

Load transformed data into the right destinations — feature stores, data warehouses, and optimized file formats — for model training and real-time serving.

Loading Destinations for ML

Where you load your data depends on how it will be consumed:

DestinationUse CaseAccess Pattern
Feature StoreOnline + offline featuresKey-value lookups, batch reads
Data WarehouseBatch training datasetsSQL queries over large tables
Object StorageTraining data filesSequential reads (Parquet, TFRecord)
Vector DatabaseEmbedding similarityNearest neighbor search
Cache (Redis)Real-time inference featuresSub-millisecond key lookups

Loading to Feature Stores

Feature stores solve the training-serving skew problem by providing a single source of truth for features:

from feast import FeatureStore, Entity, FeatureView, Field
from feast.types import Float32, Int64

# Define a feature view in Feast
driver_stats = FeatureView(
    name="driver_hourly_stats",
    entities=[Entity(name="driver_id", join_keys=["driver_id"])],
    schema=[
        Field(name="conv_rate", dtype=Float32),
        Field(name="acc_rate", dtype=Float32),
        Field(name="avg_daily_trips", dtype=Int64),
    ],
    source=driver_stats_source,
)

# Materialize features for online serving
store = FeatureStore(repo_path="feature_repo/")
store.materialize_incremental(end_date=datetime.now())
💡
Online vs Offline stores: Feature stores maintain two copies: an offline store (data warehouse) for batch training data retrieval and an online store (Redis, DynamoDB) for low-latency serving during inference.

Optimized File Formats

Choosing the right file format significantly impacts training speed and storage costs:

FormatCompressionColumn AccessBest For
ParquetExcellentYes (columnar)Tabular ML data, Spark/Pandas
TFRecordGoodNo (row-based)TensorFlow training pipelines
Arrow/FeatherGoodYes (columnar)Fast in-memory interchange
Delta LakeExcellentYes (columnar)Versioned, ACID-compliant datasets
CSVNoneNoSmall datasets, debugging only
import pyarrow as pa
import pyarrow.parquet as pq

# Write partitioned Parquet files
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
    table,
    root_path="s3://ml-data/training/",
    partition_cols=["date", "category"],
    compression="snappy"
)

Loading Patterns

Full Refresh

Replace the entire destination table. Simple but expensive. Use for small reference tables or when incremental logic is too complex.

Append Only

Add new records without modifying existing ones. Best for event/log data and time-series features.

Upsert (Merge)

Insert new records and update existing ones based on a key. The most common pattern for dimension tables.

Partition Overwrite

Overwrite specific partitions (e.g., by date) without touching others. Balances simplicity and efficiency.

Data Versioning

Version your training datasets so you can reproduce any model's results:

# Using DVC for data versioning
# dvc init
# dvc add data/training_v2.parquet
# git add data/training_v2.parquet.dvc
# git commit -m "Add training dataset v2"
# dvc push

# Using Delta Lake for versioned tables
from delta import DeltaTable

# Write with versioning
df.write.format("delta").mode("overwrite").save("/data/features")

# Read a specific version
df_v2 = spark.read.format("delta").option("versionAsOf", 2).load("/data/features")
Always version your training data alongside your model artifacts. When a model degrades in production, you need to know exactly which data was used to train it. Link dataset versions to experiment tracking entries in MLflow or W&B.