How AI Systems Differ from Traditional Systems
Traditional software is deterministic: same input, same output, every time. AI systems break this contract in fundamental ways that affect every architecture decision you make. This lesson covers what changes and why it matters for system design.
The Four Fundamental Differences
When you move from traditional software to AI-powered systems, four properties change simultaneously. Each one creates architecture challenges that traditional patterns do not address.
1. Non-Deterministic Outputs
A REST API that fetches a user record always returns the same data for the same user ID. An LLM given the same prompt may generate different text every time. A classification model may change its prediction after retraining on new data. You cannot write unit tests the same way. You need statistical testing, evaluation pipelines, and output validation layers.
2. Data-Dependent Behavior
In traditional systems, behavior is defined by code. In AI systems, behavior is defined by data. If your training data is biased, your model is biased. If your data distribution shifts, your model degrades silently. This means your data pipeline is not just infrastructure — it is part of your application logic.
3. Computationally Expensive
A traditional API call costs fractions of a cent and takes milliseconds. A single LLM inference call can cost $0.01–$0.10 and take 1–30 seconds. Training a model can cost $100–$10M. These cost differences of 100–10,000x demand fundamentally different architecture approaches to caching, batching, and resource allocation.
4. Continuous Evolution
Traditional software changes when developers deploy new code. AI systems change when data changes, models are retrained, or the world shifts. You need infrastructure for model versioning, A/B testing between model versions, gradual rollouts, and automated monitoring that detects when a model starts performing worse.
Key Components of a Production AI System
Every production AI system, regardless of whether it uses computer vision, NLP, or recommendation algorithms, contains these core components. Understanding them is the foundation for all architecture decisions in this course.
1. Data Pipeline
The data pipeline ingests, transforms, validates, and stores data for both training and inference. In production, this is typically the largest and most complex component.
# Typical data pipeline stages for a recommendation system
pipeline_stages = {
"ingestion": {
"sources": ["clickstream_kafka", "user_profiles_postgres", "product_catalog_api"],
"frequency": "real-time for clicks, hourly for profiles, daily for catalog",
"volume": "~50M click events/day, ~2M profile updates/day"
},
"transformation": {
"feature_engineering": ["click_rate_7d", "category_affinity", "session_duration_avg"],
"joins": "click events JOIN user profiles JOIN product metadata",
"output": "feature vectors for training and serving"
},
"validation": {
"schema_checks": "all required fields present, types correct",
"distribution_checks": "feature means within 2 stddev of historical",
"freshness_checks": "no data older than 2 hours in real-time path"
},
"storage": {
"training_data": "S3/GCS parquet files, partitioned by date",
"feature_store": "Redis for online features, BigQuery for offline features",
"metadata": "ML metadata store (experiment tracking)"
}
}
2. Training Infrastructure
The training pipeline takes prepared data and produces model artifacts. In production, this runs on a schedule or is triggered by data quality signals.
- Experiment tracking: Record hyperparameters, metrics, and artifacts for every training run (MLflow, Weights & Biases, Neptune)
- Compute orchestration: Allocate and release GPU resources on demand (Kubernetes, SageMaker, Vertex AI)
- Model registry: Version and store trained model artifacts with metadata (MLflow Model Registry, Vertex AI Model Registry)
- Validation gates: Automated checks that prevent bad models from reaching production (accuracy thresholds, bias checks, latency benchmarks)
3. Model Serving
The serving layer loads trained models and handles inference requests. This is where latency, throughput, and cost constraints collide.
# Simplified model serving architecture
class ModelServer:
def __init__(self):
self.model = load_model("s3://models/rec-v2.3.1/model.pt")
self.feature_store = FeatureStoreClient(endpoint="redis:6379")
self.fallback_model = load_model("s3://models/rec-v1-simple/model.pt")
async def predict(self, user_id: str, context: dict) -> list:
try:
# Fetch real-time features (p99 latency: 5ms)
features = await self.feature_store.get_features(
entity_id=user_id,
feature_names=["click_rate_7d", "category_affinity", "session_recency"]
)
# Run inference (p99 latency: 25ms on GPU)
predictions = self.model.predict(features, context)
return self._postprocess(predictions)
except ModelTimeoutError:
# Fallback to simpler, faster model
return await self._fallback_predict(user_id, context)
except FeatureStoreError:
# Serve with default features if feature store is down
return await self._predict_with_defaults(user_id, context)
4. Monitoring and Observability
AI systems need monitoring beyond traditional application metrics. You must track model-specific signals that indicate degradation.
| Metric Category | What to Monitor | Alert Threshold Example |
|---|---|---|
| Model performance | Accuracy, precision, recall, NDCG on live traffic | NDCG drops >5% vs. baseline over 1 hour |
| Data quality | Feature distributions, missing values, schema violations | >1% null rate on required features |
| Serving health | Latency (p50/p95/p99), error rate, throughput (QPS) | p99 latency >100ms, error rate >0.1% |
| Data drift | Input feature distributions vs. training distribution | KL divergence >0.1 on any feature |
| Business metrics | Click-through rate, conversion, revenue per session | CTR drops >10% vs. previous week |
| Cost | GPU utilization, inference cost per request, total spend | Cost per request >$0.05 or utilization <40% |
Common Architecture Patterns Overview
Production AI systems typically follow one of these architecture patterns. We will explore each in detail in later lessons.
Batch Prediction
Run inference on all items periodically (e.g., nightly) and store results. Fast serving (just a lookup), but predictions are stale. Good for: product recommendations, email personalization, risk scoring.
Real-Time Prediction
Run inference on each request as it arrives. Fresh predictions but requires low-latency serving infrastructure and GPU resources. Good for: search ranking, fraud detection, chatbots.
Hybrid (Near-Real-Time)
Precompute base predictions in batch, then adjust with real-time signals at serving time. Balances freshness with cost. Good for: news feeds, ad ranking, dynamic pricing.
Embedding-Based
Precompute embeddings for all items/users, then use approximate nearest neighbor search at serving time. Extremely fast and scalable. Good for: similar item recommendations, semantic search, content deduplication.
Real Example: Production Recommendation System Architecture
Here is how a real e-commerce recommendation system is architected at scale. This example handles 50,000 requests per second with p99 latency under 100ms.
Architecture: E-Commerce Recommendation System (50K QPS)
User Request
|
v
[API Gateway / Load Balancer]
|
v
[Recommendation Service] ------> [Feature Store (Redis)]
| ^
| |
|--- Real-time features ----> [Streaming Pipeline (Kafka + Flink)]
| ^
| |
| [Clickstream Events]
|
|--- Candidate Generation --> [ANN Index (FAISS/ScaNN)]
| ^
| |
| [Embedding Model (batch updated daily)]
|
|--- Ranking Model ----------> [GPU Inference Cluster (Triton)]
| ^
| |
| [Training Pipeline (daily retrain)]
| ^
| |
| [Feature Store (BigQuery - offline)]
|
|--- Business Rules ---------> [Rule Engine]
| (filter blocked items,
| apply diversity rules,
| enforce inventory)
|
v
[Response: ranked list of 20 items]
Key Design Decisions:
- Two-stage ranking: ANN retrieves 500 candidates, GPU model ranks top 20
- Feature store split: Redis for real-time (session data), BigQuery for batch (user history)
- Daily model retrain with hourly feature updates balances freshness vs. cost
- Fallback: if GPU cluster is down, serve from precomputed batch predictions in Redis
- A/B testing: 5% of traffic routed to challenger model via feature flags
Traditional vs. AI System Design: Side-by-Side
| Aspect | Traditional System | AI System |
|---|---|---|
| Testing | Unit tests, integration tests, deterministic assertions | Statistical tests, evaluation datasets, A/B tests on live traffic |
| Deployment | Blue-green, rolling update | Canary with model-specific metrics, shadow mode, gradual rollout |
| Debugging | Stack traces, logs, breakpoints | Feature importance, prediction explanations, data slice analysis |
| Versioning | Git for code | Git for code + DVC/Delta Lake for data + model registry for models |
| Rollback | Deploy previous code version | Revert to previous model version (code may be unchanged) |
| Scaling | Add more CPU instances | Add more GPU instances ($2–$30/hr each), optimize batch sizes |
| Cost driver | Compute (CPU), storage, network | GPU time (10–100x more expensive), data processing, API calls |
Key Takeaways
- AI systems are non-deterministic, data-dependent, expensive, and continuously evolving — each property demands specific architecture patterns
- Every production AI system needs four components: data pipeline, training infrastructure, model serving, and monitoring
- The data pipeline is typically the largest and most complex component — treat it as application logic, not just infrastructure
- Start every project by identifying which architecture pattern fits (batch, real-time, hybrid, or embedding-based)
- Monitor model-specific metrics (drift, accuracy, feature distributions) alongside standard service metrics
Lilly Tech Systems