Beginner

How AI Systems Differ from Traditional Systems

Traditional software is deterministic: same input, same output, every time. AI systems break this contract in fundamental ways that affect every architecture decision you make. This lesson covers what changes and why it matters for system design.

The Four Fundamental Differences

When you move from traditional software to AI-powered systems, four properties change simultaneously. Each one creates architecture challenges that traditional patterns do not address.

1. Non-Deterministic Outputs

A REST API that fetches a user record always returns the same data for the same user ID. An LLM given the same prompt may generate different text every time. A classification model may change its prediction after retraining on new data. You cannot write unit tests the same way. You need statistical testing, evaluation pipelines, and output validation layers.

2. Data-Dependent Behavior

In traditional systems, behavior is defined by code. In AI systems, behavior is defined by data. If your training data is biased, your model is biased. If your data distribution shifts, your model degrades silently. This means your data pipeline is not just infrastructure — it is part of your application logic.

3. Computationally Expensive

A traditional API call costs fractions of a cent and takes milliseconds. A single LLM inference call can cost $0.01–$0.10 and take 1–30 seconds. Training a model can cost $100–$10M. These cost differences of 100–10,000x demand fundamentally different architecture approaches to caching, batching, and resource allocation.

4. Continuous Evolution

Traditional software changes when developers deploy new code. AI systems change when data changes, models are retrained, or the world shifts. You need infrastructure for model versioning, A/B testing between model versions, gradual rollouts, and automated monitoring that detects when a model starts performing worse.

⚠

The most common mistake: Teams treat AI systems like traditional microservices — deploy a model behind an API and call it done. Without data monitoring, model versioning, fallback strategies, and evaluation pipelines, the system degrades silently. By the time someone notices, users have been getting bad results for weeks.

Key Components of a Production AI System

Every production AI system, regardless of whether it uses computer vision, NLP, or recommendation algorithms, contains these core components. Understanding them is the foundation for all architecture decisions in this course.

1. Data Pipeline

The data pipeline ingests, transforms, validates, and stores data for both training and inference. In production, this is typically the largest and most complex component.

# Typical data pipeline stages for a recommendation system
pipeline_stages = {
    "ingestion": {
        "sources": ["clickstream_kafka", "user_profiles_postgres", "product_catalog_api"],
        "frequency": "real-time for clicks, hourly for profiles, daily for catalog",
        "volume": "~50M click events/day, ~2M profile updates/day"
    },
    "transformation": {
        "feature_engineering": ["click_rate_7d", "category_affinity", "session_duration_avg"],
        "joins": "click events JOIN user profiles JOIN product metadata",
        "output": "feature vectors for training and serving"
    },
    "validation": {
        "schema_checks": "all required fields present, types correct",
        "distribution_checks": "feature means within 2 stddev of historical",
        "freshness_checks": "no data older than 2 hours in real-time path"
    },
    "storage": {
        "training_data": "S3/GCS parquet files, partitioned by date",
        "feature_store": "Redis for online features, BigQuery for offline features",
        "metadata": "ML metadata store (experiment tracking)"
    }
}

2. Training Infrastructure

The training pipeline takes prepared data and produces model artifacts. In production, this runs on a schedule or is triggered by data quality signals.

Experiment tracking: Record hyperparameters, metrics, and artifacts for every training run (MLflow, Weights & Biases, Neptune)
Compute orchestration: Allocate and release GPU resources on demand (Kubernetes, SageMaker, Vertex AI)
Model registry: Version and store trained model artifacts with metadata (MLflow Model Registry, Vertex AI Model Registry)
Validation gates: Automated checks that prevent bad models from reaching production (accuracy thresholds, bias checks, latency benchmarks)

3. Model Serving

The serving layer loads trained models and handles inference requests. This is where latency, throughput, and cost constraints collide.

# Simplified model serving architecture
class ModelServer:
    def __init__(self):
        self.model = load_model("s3://models/rec-v2.3.1/model.pt")
        self.feature_store = FeatureStoreClient(endpoint="redis:6379")
        self.fallback_model = load_model("s3://models/rec-v1-simple/model.pt")

    async def predict(self, user_id: str, context: dict) -> list:
        try:
            # Fetch real-time features (p99 latency: 5ms)
            features = await self.feature_store.get_features(
                entity_id=user_id,
                feature_names=["click_rate_7d", "category_affinity", "session_recency"]
            )
            # Run inference (p99 latency: 25ms on GPU)
            predictions = self.model.predict(features, context)
            return self._postprocess(predictions)
        except ModelTimeoutError:
            # Fallback to simpler, faster model
            return await self._fallback_predict(user_id, context)
        except FeatureStoreError:
            # Serve with default features if feature store is down
            return await self._predict_with_defaults(user_id, context)

4. Monitoring and Observability

AI systems need monitoring beyond traditional application metrics. You must track model-specific signals that indicate degradation.

Metric Category	What to Monitor	Alert Threshold Example
Model performance	Accuracy, precision, recall, NDCG on live traffic	NDCG drops >5% vs. baseline over 1 hour
Data quality	Feature distributions, missing values, schema violations	>1% null rate on required features
Serving health	Latency (p50/p95/p99), error rate, throughput (QPS)	p99 latency >100ms, error rate >0.1%
Data drift	Input feature distributions vs. training distribution	KL divergence >0.1 on any feature
Business metrics	Click-through rate, conversion, revenue per session	CTR drops >10% vs. previous week
Cost	GPU utilization, inference cost per request, total spend	Cost per request >$0.05 or utilization <40%

Common Architecture Patterns Overview

Production AI systems typically follow one of these architecture patterns. We will explore each in detail in later lessons.

Batch Prediction

Run inference on all items periodically (e.g., nightly) and store results. Fast serving (just a lookup), but predictions are stale. Good for: product recommendations, email personalization, risk scoring.

Real-Time Prediction

Run inference on each request as it arrives. Fresh predictions but requires low-latency serving infrastructure and GPU resources. Good for: search ranking, fraud detection, chatbots.

Hybrid (Near-Real-Time)

Precompute base predictions in batch, then adjust with real-time signals at serving time. Balances freshness with cost. Good for: news feeds, ad ranking, dynamic pricing.

Embedding-Based

Precompute embeddings for all items/users, then use approximate nearest neighbor search at serving time. Extremely fast and scalable. Good for: similar item recommendations, semantic search, content deduplication.

Real Example: Production Recommendation System Architecture

Here is how a real e-commerce recommendation system is architected at scale. This example handles 50,000 requests per second with p99 latency under 100ms.

Architecture: E-Commerce Recommendation System (50K QPS)

User Request
    |
    v
[API Gateway / Load Balancer]
    |
    v
[Recommendation Service] ------> [Feature Store (Redis)]
    |                                  ^
    |                                  |
    |--- Real-time features ----> [Streaming Pipeline (Kafka + Flink)]
    |                                  ^
    |                                  |
    |                             [Clickstream Events]
    |
    |--- Candidate Generation --> [ANN Index (FAISS/ScaNN)]
    |                                  ^
    |                                  |
    |                             [Embedding Model (batch updated daily)]
    |
    |--- Ranking Model ----------> [GPU Inference Cluster (Triton)]
    |                                  ^
    |                                  |
    |                             [Training Pipeline (daily retrain)]
    |                                  ^
    |                                  |
    |                             [Feature Store (BigQuery - offline)]
    |
    |--- Business Rules ---------> [Rule Engine]
    |    (filter blocked items,
    |     apply diversity rules,
    |     enforce inventory)
    |
    v
[Response: ranked list of 20 items]

Key Design Decisions:
- Two-stage ranking: ANN retrieves 500 candidates, GPU model ranks top 20
- Feature store split: Redis for real-time (session data), BigQuery for batch (user history)
- Daily model retrain with hourly feature updates balances freshness vs. cost
- Fallback: if GPU cluster is down, serve from precomputed batch predictions in Redis
- A/B testing: 5% of traffic routed to challenger model via feature flags

💡

Apply at work tomorrow: Before starting any AI project, draw a box diagram with these four components (data pipeline, training, serving, monitoring) and identify who owns each one. Most AI project failures happen because one of these components is missing or nobody owns it.

Traditional vs. AI System Design: Side-by-Side

Aspect	Traditional System	AI System
Testing	Unit tests, integration tests, deterministic assertions	Statistical tests, evaluation datasets, A/B tests on live traffic
Deployment	Blue-green, rolling update	Canary with model-specific metrics, shadow mode, gradual rollout
Debugging	Stack traces, logs, breakpoints	Feature importance, prediction explanations, data slice analysis
Versioning	Git for code	Git for code + DVC/Delta Lake for data + model registry for models
Rollback	Deploy previous code version	Revert to previous model version (code may be unchanged)
Scaling	Add more CPU instances	Add more GPU instances ($2–$30/hr each), optimize batch sizes
Cost driver	Compute (CPU), storage, network	GPU time (10–100x more expensive), data processing, API calls

Key Takeaways

💡

AI systems are non-deterministic, data-dependent, expensive, and continuously evolving — each property demands specific architecture patterns
Every production AI system needs four components: data pipeline, training infrastructure, model serving, and monitoring
The data pipeline is typically the largest and most complex component — treat it as application logic, not just infrastructure
Start every project by identifying which architecture pattern fits (batch, real-time, hybrid, or embedding-based)
Monitor model-specific metrics (drift, accuracy, feature distributions) alongside standard service metrics

Next → Requirements Analysis