Advanced

Cost-Aware AI Architecture

AI infrastructure costs can spiral from $1K/month to $100K/month without deliberate architecture decisions. This lesson gives you the frameworks and real numbers to design systems that deliver results without bankrupting your organization.

GPU Cost Modeling

GPU pricing varies 5–10x depending on how you purchase it. Understanding the pricing model is the first step to cost optimization.

GPU	VRAM	On-Demand ($/hr)	Spot/Preemptible ($/hr)	1-Year Reserved ($/hr)	3-Year Reserved ($/hr)
T4	16GB	$0.53	$0.16	$0.33	$0.22
A10G	24GB	$1.01	$0.30	$0.63	$0.42
L4	24GB	$0.81	$0.24	$0.51	$0.34
A100 40GB	40GB	$3.67	$1.10	$2.29	$1.53
A100 80GB	80GB	$4.90	$1.47	$3.06	$2.04
H100	80GB	$8.80	$2.64	$5.50	$3.67

Prices are approximate AWS/GCP rates as of early 2026 and vary by region and availability.

⚠

Spot instance caveat: Spot/preemptible instances save 60–70% but can be terminated with 30–120 seconds notice. Use them for: training jobs (with checkpointing), batch inference, non-latency-critical workers. Do NOT use them for: latency-sensitive serving endpoints where a 2-minute interruption is unacceptable.

Monthly Cost Calculator

# GPU monthly cost calculator

def estimate_monthly_cost(
    gpu_type: str,
    num_gpus: int,
    pricing: str = "on_demand",  # on_demand, spot, reserved_1y, reserved_3y
    utilization_hours_per_day: float = 24,  # 24 = always on
    days_per_month: int = 30
):
    prices = {
        "T4":       {"on_demand": 0.53, "spot": 0.16, "reserved_1y": 0.33, "reserved_3y": 0.22},
        "A10G":     {"on_demand": 1.01, "spot": 0.30, "reserved_1y": 0.63, "reserved_3y": 0.42},
        "L4":       {"on_demand": 0.81, "spot": 0.24, "reserved_1y": 0.51, "reserved_3y": 0.34},
        "A100_40":  {"on_demand": 3.67, "spot": 1.10, "reserved_1y": 2.29, "reserved_3y": 1.53},
        "A100_80":  {"on_demand": 4.90, "spot": 1.47, "reserved_1y": 3.06, "reserved_3y": 2.04},
        "H100":     {"on_demand": 8.80, "spot": 2.64, "reserved_1y": 5.50, "reserved_3y": 3.67},
    }

    hourly = prices[gpu_type][pricing]
    monthly = hourly * utilization_hours_per_day * days_per_month * num_gpus
    return monthly

# Example: 8 A10G GPUs, on-demand, 24/7
print(estimate_monthly_cost("A10G", 8, "on_demand"))    # $5,818/month

# Same setup with spot instances (for training)
print(estimate_monthly_cost("A10G", 8, "spot"))          # $1,728/month (70% savings)

# Same setup with 1-year reserved (for serving)
print(estimate_monthly_cost("A10G", 8, "reserved_1y"))   # $3,629/month (38% savings)

Inference Cost Optimization

These four techniques can reduce your inference cost by 2–10x with minimal quality loss.

Quantization

What: Reduce model precision from FP32 to FP16/INT8/INT4. Impact: FP16 = 2x faster, 50% less memory. INT8 = 2–4x faster, 75% less memory. INT4 = 4–8x faster but quality degrades. Quality loss: FP16: none. INT8: <1%. INT4: 1–5%. Tool: GPTQ, AWQ, bitsandbytes for LLMs. TensorRT for CNNs.

Knowledge Distillation

What: Train a small "student" model to mimic a large "teacher" model. Impact: 3–10x smaller model with 90–95% of the quality. Example: DistilBERT is 60% smaller than BERT with 97% of its accuracy. When: You have a working large model and need to reduce serving costs.

Caching

What: Store and reuse previous inference results. Impact: 30–60% cost reduction at typical hit rates. Cost: Redis at $0.10/GB/month vs. GPU at $1+/hr. A 10GB Redis cache costs $1/month and eliminates thousands of GPU inference calls. See: Lesson 5 for implementation details.

Model Pruning

What: Remove weights/neurons that contribute least to model output. Impact: 20–50% reduction in model size with <1% quality loss. When: Combined with quantization for maximum compression. Tool: torch.nn.utils.prune, Neural Magic SparseML.

Build vs. Buy Decision Framework

One of the most consequential architecture decisions: build your own AI infrastructure or use managed services/APIs?

Factor	Build (Self-Host)	Buy (API/Managed Service)
Upfront cost	High ($50K–$500K engineering time)	Low (pay per API call)
Marginal cost at scale	Low ($0.001–$0.01 per request)	High ($0.01–$0.10 per request)
Break-even point	Typically 100K–1M requests/day. Below this, buy. Above this, build.
Time to production	3–6 months	1–4 weeks
Team required	ML engineers + MLOps + infra	Application developers only
Data privacy	Full control, data stays internal	Data sent to third party
Customization	Full control over model, features, pipeline	Limited to API parameters
Maintenance burden	High (GPU management, model updates, monitoring)	Low (provider handles infrastructure)

💡

The hybrid approach: Start with APIs (OpenAI, Anthropic, Google) to validate the product idea. Once you have proven demand and understand your traffic patterns, selectively move the highest-volume endpoints to self-hosted models. Keep low-volume or complex tasks on APIs. This is what most successful AI startups do.

Multi-Tier Serving: Expensive Model for Hard Queries, Cheap for Easy

Not all queries are equally difficult. Route easy queries to cheap models and reserve expensive models for hard queries.

# Multi-tier serving architecture
# Saves 40-60% on inference costs by routing intelligently

class MultiTierRouter:
    def __init__(self):
        self.tiers = {
            "tier1_cache": {"cost_per_req": 0.00001, "latency_ms": 2},
            "tier2_small": {"cost_per_req": 0.0005,  "latency_ms": 10,  "model": "distilbert"},
            "tier3_medium": {"cost_per_req": 0.003,  "latency_ms": 50,  "model": "gpt-4o-mini"},
            "tier4_large": {"cost_per_req": 0.03,    "latency_ms": 500, "model": "gpt-4o"},
        }

    async def route_and_predict(self, query: str) -> dict:
        # Tier 1: Exact cache match (handles ~30% of traffic)
        cached = await self.cache.get(query_hash(query))
        if cached:
            return {"result": cached, "tier": 1, "cost": 0.00001}

        # Tier 2: Simple classifier for easy queries (handles ~40% of traffic)
        # Examples: "What's your return policy?" → FAQ lookup
        complexity = self.complexity_classifier.predict(query)
        if complexity == "simple":
            result = await self.small_model.predict(query)
            if result.confidence > 0.95:
                return {"result": result, "tier": 2, "cost": 0.0005}

        # Tier 3: Medium model for moderate queries (handles ~20% of traffic)
        result = await self.medium_model.predict(query)
        if result.confidence > 0.85:
            return {"result": result, "tier": 3, "cost": 0.003}

        # Tier 4: Large model for complex queries (handles ~10% of traffic)
        result = await self.large_model.predict(query)
        return {"result": result, "tier": 4, "cost": 0.03}

# Cost comparison at 1M requests/day:
# Without multi-tier: 1M × $0.03 = $30,000/day
# With multi-tier:
#   300K × $0.00001 (cache)  =     $3/day
#   400K × $0.0005 (small)   =   $200/day
#   200K × $0.003 (medium)   =   $600/day
#   100K × $0.03 (large)     = $3,000/day
#   Total:                    = $3,803/day (87% savings)

Cost Monitoring and Budgeting

# Cost monitoring dashboard metrics (export to Grafana/Datadog)

cost_metrics = {
    # Per-request metrics
    "inference_cost_per_request": {
        "description": "Cost of each inference call",
        "calculation": "gpu_hour_cost / requests_served_per_hour",
        "alert": "Avg cost per request > $0.05 for 1 hour"
    },

    # GPU utilization
    "gpu_utilization_percent": {
        "description": "Percentage of GPU compute being used",
        "target": "65-80%",
        "alert_low": "< 40% for 30 min → consider scaling down",
        "alert_high": "> 90% for 10 min → consider scaling up"
    },

    # Daily/monthly spend
    "daily_gpu_spend": {
        "description": "Total GPU cost for the day",
        "calculation": "sum(gpu_hours × hourly_rate) across all instances",
        "alert": "Daily spend > 120% of budget"
    },

    # Cost by model tier
    "cost_by_tier": {
        "description": "Spending breakdown by model tier",
        "alert": "Tier 4 (expensive) usage > 15% of total requests"
    },

    # Cost per business outcome
    "cost_per_conversion": {
        "description": "AI infrastructure cost per user conversion",
        "calculation": "total_ai_cost / conversions_attributed_to_ai",
        "target": "< 5% of revenue per conversion"
    },

    # Waste detection
    "idle_gpu_hours": {
        "description": "GPU hours with < 10% utilization",
        "alert": "Any GPU idle > 10% for 2+ hours"
    }
}

# Budget guardrails
budget_config = {
    "daily_hard_limit": 5000,       # Kill non-essential jobs above this
    "daily_soft_limit": 4000,       # Alert and reduce batch jobs
    "monthly_budget": 120000,       # Total monthly budget
    "emergency_shutdown_percent": 150  # Auto-shutdown at 150% of budget
}

Real Cost Comparison: Self-Hosted vs. API

Scenario	Volume	OpenAI API Cost	Self-Hosted Cost	Savings	Recommendation
Chatbot (startup)	1K req/day	$90/month	$800/month (1 GPU + eng time)	-$710	Use API
Chatbot (growth)	50K req/day	$4,500/month	$2,500/month (4 GPUs)	+$2,000	Consider self-host
Chatbot (scale)	500K req/day	$45,000/month	$8,000/month (12 GPUs)	+$37,000	Self-host
Classification	1M req/day	$15,000/month	$1,500/month (2 GPUs)	+$13,500	Self-host
Embeddings	10M req/day	$3,000/month	$2,000/month (3 GPUs)	+$1,000	Either (API is competitive)
Image generation	10K req/day	$12,000/month	$4,000/month (6 GPUs)	+$8,000	Self-host

💡

Apply at work tomorrow: Calculate your current cost per inference request. If you do not know this number, you cannot optimize. Formula: (total GPU cost + infrastructure cost) / total requests served. Then compare it to API pricing for the same task. The result often surprises teams — sometimes self-hosting is more expensive due to low utilization.

← Previous Reliability & Fault Tolerance Next → Design Checklist