Framework for ML System Design
A structured 4-step approach that works for any ML system design question. Learn how to manage your 45 minutes, what interviewers evaluate at each stage, and the scoring rubric that determines your hire/no-hire decision.
Why ML System Design Is Different
Unlike traditional system design interviews that focus on scaling web services, ML system design interviews evaluate your ability to reason about data, models, metrics, and the unique challenges of deploying machine learning in production. The interviewer wants to see that you can:
- Translate a vague business problem into a concrete ML formulation
- Design data pipelines, feature stores, and training infrastructure
- Choose appropriate model architectures and justify your decisions
- Define meaningful offline and online metrics
- Address serving latency, model freshness, and feedback loops
- Discuss trade-offs rather than presenting a single “right answer”
The 4-Step Framework
This framework works for every ML system design question. The key is allocating your time correctly and going deep in the areas where the interviewer signals interest.
Step 1: Clarify Requirements & Define the Problem (5–8 minutes)
Never jump into architecture. Start by asking clarifying questions to nail down what the system needs to do.
- Who are the users? What is the scale (DAU, QPS, data volume)?
- What is the business objective? How does ML improve it?
- What are the latency requirements? Real-time vs. batch?
- What data is available? Labels? User interactions?
- Are there fairness, privacy, or regulatory constraints?
Then formulate the ML problem clearly:
# Problem formulation template
# Business goal: Increase user engagement on news feed
# ML formulation: Predict P(user clicks post | user, post, context)
# Input: User features, post features, context features
# Output: Click probability score [0, 1]
# Training data: Historical click/impression logs
# Optimization: Binary cross-entropy loss
Step 2: High-Level Architecture (8–10 minutes)
Draw the system architecture showing how data flows from raw inputs to model predictions served to users. Cover these components:
| Component | Purpose | Key Questions |
|---|---|---|
| Data Pipeline | Collect, clean, and store training data | Batch vs. streaming? Schema evolution? Data quality checks? |
| Feature Store | Compute and serve features consistently | Online vs. offline features? Feature freshness? Point-in-time correctness? |
| Training Pipeline | Train and validate models | Training frequency? Distributed training? Hyperparameter tuning? |
| Model Registry | Version, validate, and promote models | A/B test integration? Rollback strategy? Approval workflow? |
| Serving Infrastructure | Serve predictions at scale | Latency budget? Batching? Caching? Fallback models? |
| Monitoring | Track model health and business metrics | Data drift? Model decay? Alert thresholds? |
Step 3: Deep Dive into Key Components (15–20 minutes)
This is where you spend the most time. The interviewer will usually signal which area to go deep on. Common deep dives include:
Feature Engineering
User features (demographics, history, engagement patterns), item features (content, metadata, popularity), context features (time, device, location), and cross features.
Model Architecture
Why this model over alternatives? Two-tower vs. cross-network? How to handle sparse categorical features? Cold start problem?
Metrics & Evaluation
Offline metrics (AUC, NDCG, precision@k), online metrics (CTR, engagement time, revenue), and how to run A/B tests correctly.
Serving & Scaling
Latency optimization, model distillation, feature caching, candidate generation + ranking pipeline, and graceful degradation.
Step 4: Trade-Offs & Extensions (5–7 minutes)
Wrap up by discussing trade-offs you made and potential extensions. This demonstrates senior-level thinking.
- Accuracy vs. latency: “We could use a transformer-based model for better accuracy, but the serving latency would increase from 10ms to 200ms. A distilled model offers a good middle ground.”
- Freshness vs. cost: “Real-time feature computation gives us the freshest signals but costs 10x more than hourly batch updates. We can use a hybrid approach.”
- Complexity vs. maintainability: “A multi-task learning setup could improve performance by 3%, but it makes debugging and iteration significantly harder.”
- Fairness: “We need to audit the model for bias across demographics and implement fairness constraints in the loss function.”
Time Management: The 45-Minute Blueprint
| Phase | Time | What to Cover | Common Mistakes |
|---|---|---|---|
| Clarify | 5–8 min | Requirements, constraints, ML formulation | Skipping this and jumping to architecture |
| Architecture | 8–10 min | Data flow, components, high-level diagram | Spending too long on infrastructure details |
| Deep Dive | 15–20 min | Features, model, metrics, serving | Staying too shallow across all areas |
| Trade-Offs | 5–7 min | Alternatives, extensions, limitations | Running out of time before reaching this |
The Scoring Rubric
Most FAANG companies use a rubric similar to this. Understanding it helps you allocate effort to the highest-impact areas.
| Criterion | Weight | Strong Signal | Weak Signal |
|---|---|---|---|
| Problem Formulation | 20% | Clear ML objective, correct loss function, identifies edge cases | Vague problem statement, wrong optimization target |
| System Architecture | 20% | Complete data flow, addresses scale, mentions monitoring | Missing components, no consideration of scale |
| ML Depth | 25% | Thoughtful feature engineering, model justification, training strategy | Black-box model choice, no feature discussion |
| Metrics & Evaluation | 15% | Both offline and online metrics, A/B test design | Only mentions accuracy, no online metrics |
| Trade-Off Analysis | 10% | Multiple alternatives discussed with pros/cons | Presents only one approach as the answer |
| Communication | 10% | Structured, responds to hints, manages time well | Rambles, ignores interviewer signals |
What Interviewers Look For at Each Level
Junior / New Grad (L3–L4)
- Can formulate the problem as an ML task
- Proposes reasonable features and model choice
- Mentions basic offline metrics
- Aware of training/serving split
Mid-Level (L4–L5)
- Everything above, plus:
- Designs complete data and serving pipelines
- Discusses online metrics and A/B testing
- Addresses cold start, data freshness, and model retraining
- Proposes multiple approaches with trade-offs
Senior / Staff (L5–L6+)
- Everything above, plus:
- Considers organizational and cross-team implications
- Designs for iteration speed: experiment framework, feature platform
- Addresses fairness, privacy, and regulatory requirements
- Proposes a phased rollout plan (v1 simple, v2 advanced)
Practice Checklist
Before your interview, make sure you can answer “yes” to each of these:
- ☐ Can I formulate any business problem as an ML optimization objective?
- ☐ Can I draw a complete system architecture in under 10 minutes?
- ☐ Can I list 15+ features for any ML system off the top of my head?
- ☐ Can I explain why I chose this model over 3 alternatives?
- ☐ Can I define both offline and online metrics and explain how to A/B test?
- ☐ Can I discuss at least 3 trade-offs for any design decision?
- ☐ Can I explain how the system handles failure modes and edge cases?
Lilly Tech Systems