Data Loading for Machine Learning
Load transformed data into the right destinations — feature stores, data warehouses, and optimized file formats — for model training and real-time serving.
Loading Destinations for ML
Where you load your data depends on how it will be consumed:
| Destination | Use Case | Access Pattern |
|---|---|---|
| Feature Store | Online + offline features | Key-value lookups, batch reads |
| Data Warehouse | Batch training datasets | SQL queries over large tables |
| Object Storage | Training data files | Sequential reads (Parquet, TFRecord) |
| Vector Database | Embedding similarity | Nearest neighbor search |
| Cache (Redis) | Real-time inference features | Sub-millisecond key lookups |
Loading to Feature Stores
Feature stores solve the training-serving skew problem by providing a single source of truth for features:
from feast import FeatureStore, Entity, FeatureView, Field
from feast.types import Float32, Int64
# Define a feature view in Feast
driver_stats = FeatureView(
name="driver_hourly_stats",
entities=[Entity(name="driver_id", join_keys=["driver_id"])],
schema=[
Field(name="conv_rate", dtype=Float32),
Field(name="acc_rate", dtype=Float32),
Field(name="avg_daily_trips", dtype=Int64),
],
source=driver_stats_source,
)
# Materialize features for online serving
store = FeatureStore(repo_path="feature_repo/")
store.materialize_incremental(end_date=datetime.now())
Optimized File Formats
Choosing the right file format significantly impacts training speed and storage costs:
| Format | Compression | Column Access | Best For |
|---|---|---|---|
| Parquet | Excellent | Yes (columnar) | Tabular ML data, Spark/Pandas |
| TFRecord | Good | No (row-based) | TensorFlow training pipelines |
| Arrow/Feather | Good | Yes (columnar) | Fast in-memory interchange |
| Delta Lake | Excellent | Yes (columnar) | Versioned, ACID-compliant datasets |
| CSV | None | No | Small datasets, debugging only |
import pyarrow as pa
import pyarrow.parquet as pq
# Write partitioned Parquet files
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
table,
root_path="s3://ml-data/training/",
partition_cols=["date", "category"],
compression="snappy"
)
Loading Patterns
Full Refresh
Replace the entire destination table. Simple but expensive. Use for small reference tables or when incremental logic is too complex.
Append Only
Add new records without modifying existing ones. Best for event/log data and time-series features.
Upsert (Merge)
Insert new records and update existing ones based on a key. The most common pattern for dimension tables.
Partition Overwrite
Overwrite specific partitions (e.g., by date) without touching others. Balances simplicity and efficiency.
Data Versioning
Version your training datasets so you can reproduce any model's results:
# Using DVC for data versioning
# dvc init
# dvc add data/training_v2.parquet
# git add data/training_v2.parquet.dvc
# git commit -m "Add training dataset v2"
# dvc push
# Using Delta Lake for versioned tables
from delta import DeltaTable
# Write with versioning
df.write.format("delta").mode("overwrite").save("/data/features")
# Read a specific version
df_v2 = spark.read.format("delta").option("versionAsOf", 2).load("/data/features")
Lilly Tech Systems