Intermediate

Data Architecture Patterns

Data architecture is the foundation of every AI system. The pattern you choose determines how fresh your features are, how fast you can retrain, and how much your pipeline costs to operate. This lesson covers the patterns that production teams actually use.

Lambda vs. Kappa Architecture for ML

These two patterns solve the same problem — combining batch and real-time data — but with fundamentally different approaches.

Lambda Architecture

Runs two parallel pipelines: a batch layer for comprehensive historical processing and a speed layer for real-time updates. Results are merged at serving time.

Lambda Architecture for ML Feature Pipeline:

                    [Raw Events (Kafka)]
                     /              \
                    v                v
        [Batch Layer]            [Speed Layer]
        Spark/Dataflow           Flink/Kafka Streams
        Runs daily               Processes in real-time
        Full history             Last few hours
        Complex features         Simple aggregations
             |                        |
             v                        v
    [Batch Feature Store]     [Real-Time Feature Store]
    (BigQuery/S3)             (Redis/DynamoDB)
             \                      /
              v                    v
           [Serving Layer - merges both]
           batch_features + real_time_features → model input

Example features:
  Batch:     user_avg_purchase_90d, category_affinity_score, lifetime_value
  Real-time: items_viewed_this_session, cart_value, time_since_last_click

Kappa Architecture

Uses a single streaming pipeline for everything. Historical reprocessing is done by replaying the event log from the beginning.

Kappa Architecture for ML Feature Pipeline:

    [Raw Events (Kafka with long retention)]
                    |
                    v
            [Stream Processor]
            Flink / Kafka Streams
            Handles both real-time
            and historical replay
                    |
                    v
            [Feature Store]
            Single unified store
            (Redis + cold storage)
                    |
                    v
            [Model Serving]

Retraining: Replay Kafka topic from offset 0
Feature backfill: Replay with new feature logic

Decision Matrix: Lambda vs. Kappa

Factor	Lambda	Kappa	Choose When
Complexity	High (two codebases)	Lower (one codebase)	Lambda if batch features need complex SQL joins
Feature freshness	Minutes to hours	Seconds to minutes	Kappa if all features must be near-real-time
Historical features	Easy (batch SQL)	Requires replay	Lambda if you need 90-day aggregations
Cost	Higher (two pipelines)	Lower (one pipeline)	Kappa if budget is tight
Debugging	Harder (dual paths)	Easier (single path)	Kappa for smaller teams
Production use	Most large companies	Simpler ML systems	Lambda is the industry default for complex ML

💡

Practical advice: Most production ML systems at companies like Netflix, Uber, and Airbnb use Lambda architecture because they need both complex historical features (batch) and session-level signals (real-time). Start with batch-only if you can tolerate stale features, then add a speed layer when needed.

Feature Store Patterns

A feature store is a centralized system for managing, storing, and serving ML features. It solves the critical problem of training-serving skew — when features computed during training differ from features computed during serving.

Online vs. Offline Feature Store

# Feature Store Architecture

class FeatureStore:
    """
    Dual-store architecture used by Feast, Tecton, and most custom implementations.
    """

    def __init__(self):
        # Online store: low-latency lookups for serving
        # Backed by Redis, DynamoDB, or Bigtable
        # Stores latest feature values per entity
        self.online_store = RedisClient(
            host="redis-cluster.internal",
            read_timeout_ms=5  # Must be fast
        )

        # Offline store: full history for training
        # Backed by BigQuery, S3/Parquet, or Delta Lake
        # Stores all historical feature values with timestamps
        self.offline_store = BigQueryClient(
            dataset="ml_features"
        )

    def get_online_features(self, entity_id: str, feature_names: list) -> dict:
        """Called during inference. Must return in < 10ms p99."""
        return self.online_store.hmget(f"features:{entity_id}", feature_names)

    def get_training_data(self, entity_ids: list, features: list,
                          start_date: str, end_date: str) -> DataFrame:
        """Called during training. Point-in-time correct join."""
        return self.offline_store.query(f"""
            SELECT e.entity_id, e.timestamp, {', '.join(features)}
            FROM entities e
            LEFT JOIN feature_table f
              ON e.entity_id = f.entity_id
              AND f.timestamp <= e.timestamp  -- Point-in-time correctness!
              AND f.timestamp >= TIMESTAMP_SUB(e.timestamp, INTERVAL 1 DAY)
            WHERE e.timestamp BETWEEN '{start_date}' AND '{end_date}'
        """)

    def materialize(self, feature_name: str):
        """Sync offline store → online store (runs on schedule)."""
        latest_values = self.offline_store.get_latest(feature_name)
        self.online_store.bulk_set(latest_values)

⚠

Training-serving skew kills silently: If you compute features differently during training (Python/Spark) vs. serving (Java/Go), your model sees different distributions and performs worse than offline metrics suggest. A feature store with a single feature definition used for both training and serving eliminates this problem.

Feature Store Options

Solution	Type	Online Store	Offline Store	Best For
Feast	Open source	Redis, DynamoDB	BigQuery, S3, Redshift	Teams wanting control, no vendor lock-in
Tecton	Managed	DynamoDB	S3/Spark	Teams needing streaming features with low ops burden
Vertex AI FS	Managed (GCP)	Bigtable	BigQuery	GCP-native teams
SageMaker FS	Managed (AWS)	In-memory	S3	AWS-native teams
Custom (Redis + BQ)	DIY	Redis Cluster	BigQuery/S3	Teams with specific requirements or existing infra

Data Lake vs. Data Warehouse for ML

Aspect	Data Lake (S3/GCS + Spark)	Data Warehouse (BigQuery/Snowflake)	Lakehouse (Delta Lake/Iceberg)
Schema	Schema-on-read (flexible)	Schema-on-write (strict)	Schema enforcement + evolution
Data types	Any (images, logs, parquet)	Structured (tables)	Structured + semi-structured
Cost	$0.023/GB/month (S3)	$5–$25/TB scanned	$0.023/GB + compute
ML training	Direct read by Spark/PyTorch	Export to files first	Direct read + ACID transactions
Data quality	Manual validation	Built-in constraints	Schema enforcement + expectations
Time travel	Manual versioning	Limited (7–90 days)	Full version history
Best for ML	Raw data storage, unstructured data	Feature engineering SQL, analytics	Production ML pipelines (recommended)

💡

Industry trend: Most production ML teams are converging on the lakehouse pattern (Delta Lake or Apache Iceberg on top of object storage). It gives you the cost and flexibility of a data lake with the reliability and query performance of a data warehouse. Databricks, Snowflake, and BigQuery all now support this pattern.

Data Versioning Strategies

You cannot reproduce a model without reproducing its training data. Data versioning is essential for debugging, compliance, and rollback.

DVC (Data Version Control)

How it works: Git-like commands for data files. Stores metadata in Git, actual data in S3/GCS. dvc add data/training.parquet creates a .dvc file that Git tracks. Best for: Small to medium teams, file-based workflows. Limitation: Does not handle streaming or real-time data well.

Delta Lake

How it works: ACID transactions on top of Parquet files in S3/GCS. Every write creates a new version. df.write.format("delta").mode("append").save(path). Time travel with spark.read.format("delta").option("versionAsOf", 5).load(path). Best for: Large-scale ML pipelines on Spark/Databricks.

Apache Iceberg

How it works: Table format with snapshot isolation, schema evolution, and time travel. Works with Spark, Trino, Flink, and most query engines. Best for: Multi-engine environments where different teams use different tools.

LakeFS

How it works: Git-like branching for data lakes. Create branches of your S3 data, experiment with transformations, merge when validated. Best for: Teams that want Git-style workflow for data experimentation.

Data Architecture Decision Matrix

Use this matrix to make data architecture decisions for your AI system. Match your requirements to the recommended pattern.

# Data Architecture Decision Matrix

QUESTION 1: Do you need real-time features for inference?
  YES → You need an online feature store (Redis/DynamoDB)
  NO  → Batch precomputation is sufficient (simpler, cheaper)

QUESTION 2: How complex are your features?
  Simple aggregations (count, sum, avg) → Kappa architecture (streaming only)
  Complex joins + window functions     → Lambda architecture (batch + streaming)
  Just raw data, no aggregation        → Skip feature engineering, embed at serving time

QUESTION 3: How much data do you have?
  < 10GB    → Just use PostgreSQL + CSV files. Seriously.
  10-500GB  → Data lake (S3/GCS) + DVC for versioning
  500GB-10TB → Lakehouse (Delta Lake/Iceberg) + feature store
  > 10TB    → Full Lambda architecture + managed feature store (Tecton/Vertex)

QUESTION 4: How many teams share the data?
  1 team    → Keep it simple. Local feature pipelines, DVC.
  2-5 teams → Feature store becomes critical for consistency
  5+ teams  → Managed feature store + data catalog + governance

QUESTION 5: What are your compliance requirements?
  None      → Flexible schema, minimal governance
  SOC2      → Audit logs, access control, encryption at rest
  GDPR      → Data lineage, deletion capabilities, consent tracking
  HIPAA     → All of above + BAA with cloud provider, strict access control

💡

Apply at work tomorrow: Walk through the decision matrix above with your team. Most teams over-engineer their data architecture early (building a full Lambda architecture when batch-only would work) or under-engineer it late (discovering they need real-time features after the system is in production). The matrix helps you right-size from day one.

← Previous Requirements Analysis Next → Model Serving Architecture