Data Architecture Patterns
Data architecture is the foundation of every AI system. The pattern you choose determines how fresh your features are, how fast you can retrain, and how much your pipeline costs to operate. This lesson covers the patterns that production teams actually use.
Lambda vs. Kappa Architecture for ML
These two patterns solve the same problem — combining batch and real-time data — but with fundamentally different approaches.
Lambda Architecture
Runs two parallel pipelines: a batch layer for comprehensive historical processing and a speed layer for real-time updates. Results are merged at serving time.
Lambda Architecture for ML Feature Pipeline:
[Raw Events (Kafka)]
/ \
v v
[Batch Layer] [Speed Layer]
Spark/Dataflow Flink/Kafka Streams
Runs daily Processes in real-time
Full history Last few hours
Complex features Simple aggregations
| |
v v
[Batch Feature Store] [Real-Time Feature Store]
(BigQuery/S3) (Redis/DynamoDB)
\ /
v v
[Serving Layer - merges both]
batch_features + real_time_features → model input
Example features:
Batch: user_avg_purchase_90d, category_affinity_score, lifetime_value
Real-time: items_viewed_this_session, cart_value, time_since_last_click
Kappa Architecture
Uses a single streaming pipeline for everything. Historical reprocessing is done by replaying the event log from the beginning.
Kappa Architecture for ML Feature Pipeline:
[Raw Events (Kafka with long retention)]
|
v
[Stream Processor]
Flink / Kafka Streams
Handles both real-time
and historical replay
|
v
[Feature Store]
Single unified store
(Redis + cold storage)
|
v
[Model Serving]
Retraining: Replay Kafka topic from offset 0
Feature backfill: Replay with new feature logic
Decision Matrix: Lambda vs. Kappa
| Factor | Lambda | Kappa | Choose When |
|---|---|---|---|
| Complexity | High (two codebases) | Lower (one codebase) | Lambda if batch features need complex SQL joins |
| Feature freshness | Minutes to hours | Seconds to minutes | Kappa if all features must be near-real-time |
| Historical features | Easy (batch SQL) | Requires replay | Lambda if you need 90-day aggregations |
| Cost | Higher (two pipelines) | Lower (one pipeline) | Kappa if budget is tight |
| Debugging | Harder (dual paths) | Easier (single path) | Kappa for smaller teams |
| Production use | Most large companies | Simpler ML systems | Lambda is the industry default for complex ML |
Feature Store Patterns
A feature store is a centralized system for managing, storing, and serving ML features. It solves the critical problem of training-serving skew — when features computed during training differ from features computed during serving.
Online vs. Offline Feature Store
# Feature Store Architecture
class FeatureStore:
"""
Dual-store architecture used by Feast, Tecton, and most custom implementations.
"""
def __init__(self):
# Online store: low-latency lookups for serving
# Backed by Redis, DynamoDB, or Bigtable
# Stores latest feature values per entity
self.online_store = RedisClient(
host="redis-cluster.internal",
read_timeout_ms=5 # Must be fast
)
# Offline store: full history for training
# Backed by BigQuery, S3/Parquet, or Delta Lake
# Stores all historical feature values with timestamps
self.offline_store = BigQueryClient(
dataset="ml_features"
)
def get_online_features(self, entity_id: str, feature_names: list) -> dict:
"""Called during inference. Must return in < 10ms p99."""
return self.online_store.hmget(f"features:{entity_id}", feature_names)
def get_training_data(self, entity_ids: list, features: list,
start_date: str, end_date: str) -> DataFrame:
"""Called during training. Point-in-time correct join."""
return self.offline_store.query(f"""
SELECT e.entity_id, e.timestamp, {', '.join(features)}
FROM entities e
LEFT JOIN feature_table f
ON e.entity_id = f.entity_id
AND f.timestamp <= e.timestamp -- Point-in-time correctness!
AND f.timestamp >= TIMESTAMP_SUB(e.timestamp, INTERVAL 1 DAY)
WHERE e.timestamp BETWEEN '{start_date}' AND '{end_date}'
""")
def materialize(self, feature_name: str):
"""Sync offline store → online store (runs on schedule)."""
latest_values = self.offline_store.get_latest(feature_name)
self.online_store.bulk_set(latest_values)
Feature Store Options
| Solution | Type | Online Store | Offline Store | Best For |
|---|---|---|---|---|
| Feast | Open source | Redis, DynamoDB | BigQuery, S3, Redshift | Teams wanting control, no vendor lock-in |
| Tecton | Managed | DynamoDB | S3/Spark | Teams needing streaming features with low ops burden |
| Vertex AI FS | Managed (GCP) | Bigtable | BigQuery | GCP-native teams |
| SageMaker FS | Managed (AWS) | In-memory | S3 | AWS-native teams |
| Custom (Redis + BQ) | DIY | Redis Cluster | BigQuery/S3 | Teams with specific requirements or existing infra |
Data Lake vs. Data Warehouse for ML
| Aspect | Data Lake (S3/GCS + Spark) | Data Warehouse (BigQuery/Snowflake) | Lakehouse (Delta Lake/Iceberg) |
|---|---|---|---|
| Schema | Schema-on-read (flexible) | Schema-on-write (strict) | Schema enforcement + evolution |
| Data types | Any (images, logs, parquet) | Structured (tables) | Structured + semi-structured |
| Cost | $0.023/GB/month (S3) | $5–$25/TB scanned | $0.023/GB + compute |
| ML training | Direct read by Spark/PyTorch | Export to files first | Direct read + ACID transactions |
| Data quality | Manual validation | Built-in constraints | Schema enforcement + expectations |
| Time travel | Manual versioning | Limited (7–90 days) | Full version history |
| Best for ML | Raw data storage, unstructured data | Feature engineering SQL, analytics | Production ML pipelines (recommended) |
Data Versioning Strategies
You cannot reproduce a model without reproducing its training data. Data versioning is essential for debugging, compliance, and rollback.
DVC (Data Version Control)
How it works: Git-like commands for data files. Stores metadata in Git, actual data in S3/GCS. dvc add data/training.parquet creates a .dvc file that Git tracks. Best for: Small to medium teams, file-based workflows. Limitation: Does not handle streaming or real-time data well.
Delta Lake
How it works: ACID transactions on top of Parquet files in S3/GCS. Every write creates a new version. df.write.format("delta").mode("append").save(path). Time travel with spark.read.format("delta").option("versionAsOf", 5).load(path). Best for: Large-scale ML pipelines on Spark/Databricks.
Apache Iceberg
How it works: Table format with snapshot isolation, schema evolution, and time travel. Works with Spark, Trino, Flink, and most query engines. Best for: Multi-engine environments where different teams use different tools.
LakeFS
How it works: Git-like branching for data lakes. Create branches of your S3 data, experiment with transformations, merge when validated. Best for: Teams that want Git-style workflow for data experimentation.
Data Architecture Decision Matrix
Use this matrix to make data architecture decisions for your AI system. Match your requirements to the recommended pattern.
# Data Architecture Decision Matrix
QUESTION 1: Do you need real-time features for inference?
YES → You need an online feature store (Redis/DynamoDB)
NO → Batch precomputation is sufficient (simpler, cheaper)
QUESTION 2: How complex are your features?
Simple aggregations (count, sum, avg) → Kappa architecture (streaming only)
Complex joins + window functions → Lambda architecture (batch + streaming)
Just raw data, no aggregation → Skip feature engineering, embed at serving time
QUESTION 3: How much data do you have?
< 10GB → Just use PostgreSQL + CSV files. Seriously.
10-500GB → Data lake (S3/GCS) + DVC for versioning
500GB-10TB → Lakehouse (Delta Lake/Iceberg) + feature store
> 10TB → Full Lambda architecture + managed feature store (Tecton/Vertex)
QUESTION 4: How many teams share the data?
1 team → Keep it simple. Local feature pipelines, DVC.
2-5 teams → Feature store becomes critical for consistency
5+ teams → Managed feature store + data catalog + governance
QUESTION 5: What are your compliance requirements?
None → Flexible schema, minimal governance
SOC2 → Audit logs, access control, encryption at rest
GDPR → Data lineage, deletion capabilities, consent tracking
HIPAA → All of above + BAA with cloud provider, strict access control
Lilly Tech Systems