Advanced

Design Checklist & Best Practices

This final lesson consolidates everything from the course into actionable checklists, templates, and answers to the questions teams ask most. Print the checklist. Use the review template. Avoid the common mistakes. Ship better AI systems.

Complete AI System Design Checklist

Use this checklist at each phase of your AI project. Each item links back to the relevant lesson.

Phase 1: Requirements (Before Writing Code)

[ ] Problem statement defined with clear success metrics
[ ] Confirmed AI is needed (ruled out rule-based or statistical approaches)
[ ] Input/output contract documented
[ ] Quality metrics and targets defined (accuracy, precision, recall, etc.)
[ ] Latency budgets set: p50=___ms, p95=___ms, p99=___ms
[ ] Throughput estimated: avg QPS=___, peak QPS=___, design QPS=___
[ ] Data volume estimated: training=___GB, inference logs=___GB/month
[ ] Monthly infrastructure budget approved: $___
[ ] Cost per request target defined: $___
[ ] Compliance requirements identified (GDPR, HIPAA, SOC2)
[ ] Explainability requirements documented
[ ] Model update frequency decided
[ ] Stakeholder sign-off on requirements document

Phase 2: Data Architecture

[ ] Data sources identified and access confirmed
[ ] Data pipeline architecture chosen (batch / streaming / lambda)
[ ] Feature store decision made (build / buy / skip)
[ ] Training-serving skew prevention strategy in place
[ ] Data versioning strategy chosen (DVC / Delta Lake / Iceberg)
[ ] Data quality checks implemented (schema, distribution, freshness)
[ ] Data retention and deletion policies defined
[ ] PII handling documented and implemented
[ ] Offline and online feature stores connected
[ ] Point-in-time correctness verified for training data joins

Phase 3: Model Development

[ ] Baseline model established (simple model or heuristic)
[ ] Experiment tracking configured (MLflow, W&B, Neptune)
[ ] Training pipeline reproducible (code + data + config = same model)
[ ] Model evaluation on held-out test set AND production-like data
[ ] Bias and fairness checks completed
[ ] Model size fits within serving constraints (memory, latency)
[ ] Model registered in model registry with metadata
[ ] Offline metrics meet quality targets before proceeding

Phase 4: Serving Infrastructure

[ ] Inference server chosen (Triton / TorchServe / vLLM / BentoML)
[ ] Dynamic batching configured and tested
[ ] GPU resource requests and limits set in Kubernetes
[ ] Liveness, readiness, and startup probes configured
[ ] Model load time measured (startup probe timeout set accordingly)
[ ] Auto-scaling configured with appropriate metrics and cooldowns
[ ] Fallback chain implemented (simpler model → rules → cache → static)
[ ] Circuit breaker implemented for model endpoint
[ ] Caching layer implemented (result cache, embedding cache)
[ ] Load testing passed at 2x design QPS

Phase 5: Deployment

[ ] Shadow mode deployment completed (new model alongside old)
[ ] Shadow mode metrics compared and validated
[ ] Canary deployment at 5% traffic with automated rollback
[ ] Gradual rollout plan defined (5% → 25% → 50% → 100%)
[ ] Rollback procedure documented and tested
[ ] Feature flags configured for traffic routing
[ ] Deployment runbook written and reviewed
[ ] On-call rotation established for AI service

Phase 6: Monitoring and Operations

[ ] Model performance metrics dashboarded (accuracy, quality scores)
[ ] Data drift monitoring active (feature distribution tracking)
[ ] Serving metrics dashboarded (latency p50/p95/p99, error rate, QPS)
[ ] Cost monitoring active (GPU utilization, cost per request, daily spend)
[ ] Business metrics tracked (CTR, conversion, revenue impact)
[ ] Alerting configured for all critical metrics
[ ] Incident response playbook written for model degradation
[ ] Model retraining trigger defined (scheduled or drift-based)
[ ] Monthly model review process established
[ ] Cost optimization review quarterly

Common Mistakes and How to Avoid Them

Mistake	Why It Happens	How to Avoid It
No fallback when model is down	Team assumes model will always be available	Implement fallback chain on day 1. Test it monthly with chaos engineering.
Training-serving skew	Features computed differently in training (Python) vs serving (Java)	Use a feature store with a single feature definition for both.
No data monitoring	Team monitors server metrics but not data quality	Monitor feature distributions and set drift alerts before launch.
Over-engineering early	Team builds Lambda architecture for 100 QPS	Start simple. Batch features + single model. Add complexity when needed.
Under-estimating costs	Budget for GPUs only, not data pipelines, storage, monitoring, or engineering	Multiply raw GPU cost by 3–5x for total cost of ownership.
Deploying without shadow mode	Pressure to ship fast	Shadow mode takes 2 days to set up and catches 90% of deployment issues.
Not versioning data	"We'll figure it out later"	Set up DVC or Delta Lake before writing the first training script.
Ignoring cold start	Model works for existing users but fails for new ones	Design a cold-start strategy: popular items, demographic defaults, or content-based fallback.
Single model, no routing	Simplicity preference	Add a complexity classifier that routes easy queries to a cheap model. Saves 50%+ on inference costs.
No offline evaluation pipeline	Team relies on A/B tests for everything	Build an offline evaluation pipeline that runs on every model change. A/B tests validate, not discover.

Architecture Review Template

Use this template when reviewing an AI system design (your own or someone else's). Each section should have a clear answer before the design is approved.

# AI Architecture Review Template

## System Overview
- What problem does this system solve?
- What is the expected business impact?
- What are the key metrics for success?

## Data Architecture
- Where does training data come from?
- How is training-serving skew prevented?
- What is the data freshness requirement?
- How is data versioned?
- What happens when data quality degrades?

## Model Architecture
- What model type/architecture is used and why?
- What are the offline evaluation results?
- How does the model handle edge cases and out-of-distribution inputs?
- What is the model size and inference latency?
- How often will the model be retrained?

## Serving Architecture
- Synchronous or asynchronous inference?
- What inference server is used and why?
- What is the dynamic batching configuration?
- What is the auto-scaling strategy?
- What is the deployment strategy (shadow → canary → rollout)?

## Reliability
- What is the fallback chain? (List all tiers)
- Is there a circuit breaker?
- What are the health check levels?
- What is the target SLA?
- What is the disaster recovery plan?

## Cost
- What is the estimated monthly cost?
- What is the cost per request?
- Are there cost optimization opportunities (caching, batching, quantization)?
- What is the cost monitoring strategy?
- What is the break-even analysis (build vs buy)?

## Risks and Mitigations
- What are the top 3 technical risks?
- What are the top 3 operational risks?
- What is the rollback plan?
- Who is on-call for this system?

Team Structure for AI Projects

The right team structure prevents the most common organizational failure: ML engineers build a great model that never makes it to production because nobody owns the serving infrastructure.

Small Team (3–5 people)

Roles: 1–2 ML engineers (model + data), 1 backend engineer (serving + API), 1 platform/DevOps (infra + monitoring). ML engineers own model quality. Backend engineer owns serving reliability. Everyone shares on-call. Best for: Startups, first AI project.

Medium Team (6–12 people)

Roles: 2–3 ML engineers, 1–2 data engineers (pipelines + feature store), 2–3 backend engineers (serving + API + integration), 1 MLOps engineer (training infra + CI/CD + monitoring), 1 PM. Best for: Growth-stage companies, critical AI products.

Large Team (12+ people)

Roles: Dedicated ML research team, ML platform team (feature store, training infra, model registry), serving team (inference, scaling, reliability), data engineering team (pipelines, quality, governance), applied ML team (feature development, A/B testing). Best for: Large companies with multiple AI products.

💡

The most critical hire: If you can only add one person to your AI team, hire an ML engineer who can also do backend engineering. The rarest and most valuable skill in AI is the ability to take a model from notebook to production. Pure researchers who cannot deploy, or pure backend engineers who do not understand ML, create handoff bottlenecks.

Frequently Asked Questions

Use batch inference when: predictions can be stale (recommendations updated daily), the prediction space is bounded (all products for all users), or cost is the primary constraint. Use real-time inference when: the input depends on the current moment (search query, fraud transaction, conversation), the prediction space is unbounded, or freshness directly impacts quality. Rule of thumb: Start with batch. Switch to real-time only for features where staleness measurably hurts business metrics.

You need a feature store when: (1) multiple models share the same features, (2) you have training-serving skew problems, (3) you need both real-time and batch features, or (4) you have more than 50 features. If you have a single model with fewer than 20 features and no training-serving skew, a PostgreSQL table + Redis cache is fine. Do not over-engineer. Many successful ML systems at smaller companies run without a formal feature store.

Frame it in business terms: (1) "We are spending $X/month on GPU infrastructure with Y% utilization. With proper auto-scaling and caching, we can cut this to $X/2." (2) "Our model deployment takes 2 weeks. With CI/CD for ML, it takes 2 days. That means we iterate 5x faster." (3) "We had Z hours of model downtime last quarter because we had no fallback. That cost us $W in lost revenue." Always connect infrastructure investment to revenue, cost savings, or velocity.

Start with these 6 panels: (1) Request rate (QPS over time), (2) Latency (p50, p95, p99 over time), (3) Error rate (% of failed requests), (4) GPU utilization (% over time), (5) Model quality (your primary metric, e.g., accuracy or CTR, over time), (6) Fallback rate (% of requests served by non-primary tiers). These 6 metrics catch 90% of issues. Add data drift monitoring in month 2.

Store every model version in a model registry (MLflow, Vertex AI, or even S3 with a naming convention like s3://models/rec-model/v2.3.1/model.pt). Keep the last 5–10 versions. Your deployment config should reference a specific version, and rollback is just updating the config to point to the previous version and redeploying. Automate rollback: if canary metrics degrade beyond threshold, automatically revert to the previous version. Average rollback time should be under 5 minutes.

Use Kubernetes when: you need fine-grained control over GPU scheduling, you are running multiple models with different resource requirements, your team has Kubernetes expertise, or you want to avoid vendor lock-in. Use managed platforms (SageMaker, Vertex AI) when: your team is small (<5 ML engineers), you want to focus on model development not infrastructure, or you are already deep in that cloud provider's ecosystem. The managed platform costs 20–40% more but saves 1–2 full-time engineer equivalents in operations.

At absolute minimum, before going to production: (1) Alert when error rate exceeds 1%. (2) Alert when p99 latency exceeds your SLA. (3) Alert when fallback rate exceeds 10%. (4) Daily email with model quality metrics vs. baseline. That is 4 alerts and 1 daily report. You can set this up in 2 hours with Prometheus + Grafana or Datadog. No excuses for launching without at least this level of monitoring.

Course Summary

💡

You have completed AI System Design Fundamentals. Here is what you can now do:

Lesson 1: Explain why AI systems require different architecture patterns than traditional software
Lesson 2: Write a complete requirements document for an AI system with real latency, throughput, and cost numbers
Lesson 3: Choose between Lambda/Kappa architectures and design a feature store strategy
Lesson 4: Select the right inference server, design canary deployments, and write production Kubernetes YAML
Lesson 5: Implement caching, batching, and parallelism strategies with real cost/latency trade-off analysis
Lesson 6: Build fault-tolerant AI services with fallback chains, circuit breakers, and proper SLAs
Lesson 7: Model GPU costs accurately, optimize inference spend, and make build vs. buy decisions
Lesson 8: Use the design checklist and architecture review template for every AI project

Next step: Pick one AI system at your company (or one you want to build) and walk through the complete checklist. The gap between where you are today and what the checklist requires is your roadmap for the next 3–6 months.

← Previous Cost-Aware Architecture Back to Course → Course Overview