Cost-Aware AI Architecture
AI infrastructure costs can spiral from $1K/month to $100K/month without deliberate architecture decisions. This lesson gives you the frameworks and real numbers to design systems that deliver results without bankrupting your organization.
GPU Cost Modeling
GPU pricing varies 5–10x depending on how you purchase it. Understanding the pricing model is the first step to cost optimization.
| GPU | VRAM | On-Demand ($/hr) | Spot/Preemptible ($/hr) | 1-Year Reserved ($/hr) | 3-Year Reserved ($/hr) |
|---|---|---|---|---|---|
| T4 | 16GB | $0.53 | $0.16 | $0.33 | $0.22 |
| A10G | 24GB | $1.01 | $0.30 | $0.63 | $0.42 |
| L4 | 24GB | $0.81 | $0.24 | $0.51 | $0.34 |
| A100 40GB | 40GB | $3.67 | $1.10 | $2.29 | $1.53 |
| A100 80GB | 80GB | $4.90 | $1.47 | $3.06 | $2.04 |
| H100 | 80GB | $8.80 | $2.64 | $5.50 | $3.67 |
Prices are approximate AWS/GCP rates as of early 2026 and vary by region and availability.
Monthly Cost Calculator
# GPU monthly cost calculator
def estimate_monthly_cost(
gpu_type: str,
num_gpus: int,
pricing: str = "on_demand", # on_demand, spot, reserved_1y, reserved_3y
utilization_hours_per_day: float = 24, # 24 = always on
days_per_month: int = 30
):
prices = {
"T4": {"on_demand": 0.53, "spot": 0.16, "reserved_1y": 0.33, "reserved_3y": 0.22},
"A10G": {"on_demand": 1.01, "spot": 0.30, "reserved_1y": 0.63, "reserved_3y": 0.42},
"L4": {"on_demand": 0.81, "spot": 0.24, "reserved_1y": 0.51, "reserved_3y": 0.34},
"A100_40": {"on_demand": 3.67, "spot": 1.10, "reserved_1y": 2.29, "reserved_3y": 1.53},
"A100_80": {"on_demand": 4.90, "spot": 1.47, "reserved_1y": 3.06, "reserved_3y": 2.04},
"H100": {"on_demand": 8.80, "spot": 2.64, "reserved_1y": 5.50, "reserved_3y": 3.67},
}
hourly = prices[gpu_type][pricing]
monthly = hourly * utilization_hours_per_day * days_per_month * num_gpus
return monthly
# Example: 8 A10G GPUs, on-demand, 24/7
print(estimate_monthly_cost("A10G", 8, "on_demand")) # $5,818/month
# Same setup with spot instances (for training)
print(estimate_monthly_cost("A10G", 8, "spot")) # $1,728/month (70% savings)
# Same setup with 1-year reserved (for serving)
print(estimate_monthly_cost("A10G", 8, "reserved_1y")) # $3,629/month (38% savings)
Inference Cost Optimization
These four techniques can reduce your inference cost by 2–10x with minimal quality loss.
Quantization
What: Reduce model precision from FP32 to FP16/INT8/INT4. Impact: FP16 = 2x faster, 50% less memory. INT8 = 2–4x faster, 75% less memory. INT4 = 4–8x faster but quality degrades. Quality loss: FP16: none. INT8: <1%. INT4: 1–5%. Tool: GPTQ, AWQ, bitsandbytes for LLMs. TensorRT for CNNs.
Knowledge Distillation
What: Train a small "student" model to mimic a large "teacher" model. Impact: 3–10x smaller model with 90–95% of the quality. Example: DistilBERT is 60% smaller than BERT with 97% of its accuracy. When: You have a working large model and need to reduce serving costs.
Caching
What: Store and reuse previous inference results. Impact: 30–60% cost reduction at typical hit rates. Cost: Redis at $0.10/GB/month vs. GPU at $1+/hr. A 10GB Redis cache costs $1/month and eliminates thousands of GPU inference calls. See: Lesson 5 for implementation details.
Model Pruning
What: Remove weights/neurons that contribute least to model output. Impact: 20–50% reduction in model size with <1% quality loss. When: Combined with quantization for maximum compression. Tool: torch.nn.utils.prune, Neural Magic SparseML.
Build vs. Buy Decision Framework
One of the most consequential architecture decisions: build your own AI infrastructure or use managed services/APIs?
| Factor | Build (Self-Host) | Buy (API/Managed Service) |
|---|---|---|
| Upfront cost | High ($50K–$500K engineering time) | Low (pay per API call) |
| Marginal cost at scale | Low ($0.001–$0.01 per request) | High ($0.01–$0.10 per request) |
| Break-even point | Typically 100K–1M requests/day. Below this, buy. Above this, build. | |
| Time to production | 3–6 months | 1–4 weeks |
| Team required | ML engineers + MLOps + infra | Application developers only |
| Data privacy | Full control, data stays internal | Data sent to third party |
| Customization | Full control over model, features, pipeline | Limited to API parameters |
| Maintenance burden | High (GPU management, model updates, monitoring) | Low (provider handles infrastructure) |
Multi-Tier Serving: Expensive Model for Hard Queries, Cheap for Easy
Not all queries are equally difficult. Route easy queries to cheap models and reserve expensive models for hard queries.
# Multi-tier serving architecture
# Saves 40-60% on inference costs by routing intelligently
class MultiTierRouter:
def __init__(self):
self.tiers = {
"tier1_cache": {"cost_per_req": 0.00001, "latency_ms": 2},
"tier2_small": {"cost_per_req": 0.0005, "latency_ms": 10, "model": "distilbert"},
"tier3_medium": {"cost_per_req": 0.003, "latency_ms": 50, "model": "gpt-4o-mini"},
"tier4_large": {"cost_per_req": 0.03, "latency_ms": 500, "model": "gpt-4o"},
}
async def route_and_predict(self, query: str) -> dict:
# Tier 1: Exact cache match (handles ~30% of traffic)
cached = await self.cache.get(query_hash(query))
if cached:
return {"result": cached, "tier": 1, "cost": 0.00001}
# Tier 2: Simple classifier for easy queries (handles ~40% of traffic)
# Examples: "What's your return policy?" → FAQ lookup
complexity = self.complexity_classifier.predict(query)
if complexity == "simple":
result = await self.small_model.predict(query)
if result.confidence > 0.95:
return {"result": result, "tier": 2, "cost": 0.0005}
# Tier 3: Medium model for moderate queries (handles ~20% of traffic)
result = await self.medium_model.predict(query)
if result.confidence > 0.85:
return {"result": result, "tier": 3, "cost": 0.003}
# Tier 4: Large model for complex queries (handles ~10% of traffic)
result = await self.large_model.predict(query)
return {"result": result, "tier": 4, "cost": 0.03}
# Cost comparison at 1M requests/day:
# Without multi-tier: 1M × $0.03 = $30,000/day
# With multi-tier:
# 300K × $0.00001 (cache) = $3/day
# 400K × $0.0005 (small) = $200/day
# 200K × $0.003 (medium) = $600/day
# 100K × $0.03 (large) = $3,000/day
# Total: = $3,803/day (87% savings)
Cost Monitoring and Budgeting
# Cost monitoring dashboard metrics (export to Grafana/Datadog)
cost_metrics = {
# Per-request metrics
"inference_cost_per_request": {
"description": "Cost of each inference call",
"calculation": "gpu_hour_cost / requests_served_per_hour",
"alert": "Avg cost per request > $0.05 for 1 hour"
},
# GPU utilization
"gpu_utilization_percent": {
"description": "Percentage of GPU compute being used",
"target": "65-80%",
"alert_low": "< 40% for 30 min → consider scaling down",
"alert_high": "> 90% for 10 min → consider scaling up"
},
# Daily/monthly spend
"daily_gpu_spend": {
"description": "Total GPU cost for the day",
"calculation": "sum(gpu_hours × hourly_rate) across all instances",
"alert": "Daily spend > 120% of budget"
},
# Cost by model tier
"cost_by_tier": {
"description": "Spending breakdown by model tier",
"alert": "Tier 4 (expensive) usage > 15% of total requests"
},
# Cost per business outcome
"cost_per_conversion": {
"description": "AI infrastructure cost per user conversion",
"calculation": "total_ai_cost / conversions_attributed_to_ai",
"target": "< 5% of revenue per conversion"
},
# Waste detection
"idle_gpu_hours": {
"description": "GPU hours with < 10% utilization",
"alert": "Any GPU idle > 10% for 2+ hours"
}
}
# Budget guardrails
budget_config = {
"daily_hard_limit": 5000, # Kill non-essential jobs above this
"daily_soft_limit": 4000, # Alert and reduce batch jobs
"monthly_budget": 120000, # Total monthly budget
"emergency_shutdown_percent": 150 # Auto-shutdown at 150% of budget
}
Real Cost Comparison: Self-Hosted vs. API
| Scenario | Volume | OpenAI API Cost | Self-Hosted Cost | Savings | Recommendation |
|---|---|---|---|---|---|
| Chatbot (startup) | 1K req/day | $90/month | $800/month (1 GPU + eng time) | -$710 | Use API |
| Chatbot (growth) | 50K req/day | $4,500/month | $2,500/month (4 GPUs) | +$2,000 | Consider self-host |
| Chatbot (scale) | 500K req/day | $45,000/month | $8,000/month (12 GPUs) | +$37,000 | Self-host |
| Classification | 1M req/day | $15,000/month | $1,500/month (2 GPUs) | +$13,500 | Self-host |
| Embeddings | 10M req/day | $3,000/month | $2,000/month (3 GPUs) | +$1,000 | Either (API is competitive) |
| Image generation | 10K req/day | $12,000/month | $4,000/month (6 GPUs) | +$8,000 | Self-host |
Lilly Tech Systems