Advanced

Best Practices & Checklist

This final lesson consolidates everything into an actionable deployment checklist, a step-by-step migration guide for moving teams from direct API calls to your gateway, multi-region deployment patterns, and answers to the most common questions platform engineers ask when building AI gateways.

Gateway Deployment Checklist

Use this checklist before going live. Each item maps to a lesson in this course:

Infrastructure (Lesson 1)

Gateway deployed behind a load balancer — minimum 2 instances for high availability
Redis deployed for state — rate limits, budgets, exact-match cache, with persistence enabled
PostgreSQL deployed for analytics — cost records, audit logs, usage dashboards
Health check endpoint — /health returns status of all dependencies (Redis, DB, providers)
Provider API keys in vault — AWS Secrets Manager, HashiCorp Vault, or Azure Key Vault

Routing (Lesson 2)

At least 2 providers configured — primary + fallback from different vendors
Fallback chains defined for every model — e.g., GPT-4o falls back to Claude Sonnet
Circuit breaker enabled — 3 consecutive failures opens circuit for 30s+
Health check probe for each provider — periodic GET to /models endpoint
Request format translation tested — OpenAI to Anthropic and Google formats verified

Rate Limiting (Lesson 3)

Three-level limits configured — organization, team, and per-app
Token-based limits enabled — not just request-per-minute
Rate limit headers returned — X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After
Burst policy defined — reject, queue, or borrow from org pool

Cost Control (Lesson 4)

Pricing table updated — matches current provider pricing as of this month
Daily and monthly budgets set per team — warning at 80%, hard limit at 100%
Cost tracking recording every request — team, user, model, tokens, cost in PostgreSQL
Spending dashboard deployed — daily/weekly cost by team, model, user
Anomaly detection query running — alert when team spends 2x their daily average

Security (Lesson 5)

Gateway keys issued to every team — no team has direct provider keys
Key rotation policy set — 90-day expiry with 24-hour grace period
PII filter enabled — blocks SSNs, credit cards, API keys; redacts emails, phones
Audit logging enabled — every request logged to immutable storage (S3 Object Lock)
Data residency rules configured — EU data to EU endpoints, PHI to BAA providers
TLS 1.3 on all connections — including internal services

Performance (Lesson 6)

Exact match cache enabled — Redis, 1-hour default TTL
Connection pooling configured — persistent HTTP/2 connections to providers
Streaming support verified — SSE forwarding tested with all providers
Cache hit rate monitoring — daily report, alert if drops below 10%
Gateway latency under 15ms — measured at p95 excluding provider response time

Migration Guide: Direct API Calls to Gateway

Migrating teams from direct API calls to the gateway is the hardest part. Here is a proven 4-week migration plan:

Week	Actions	Success Criteria
Week 1	Deploy gateway in shadow mode (proxy all requests but also log direct API calls), issue gateway keys to pilot team, set up monitoring dashboard	Pilot team running through gateway with zero errors
Week 2	Enable rate limiting and cost tracking for pilot team, gather feedback, tune limits, fix edge cases (streaming, tool use, multimodal)	Rate limiting and cost tracking accurate within 5%
Week 3	Migrate 2-3 more teams, enable PII filtering, deploy spending dashboard, configure alerts, test failover by blocking primary provider	50% of API traffic through gateway, failover works
Week 4	Migrate remaining teams, revoke direct provider API keys, enable caching, send weekly cost reports to team leads	100% of traffic through gateway, direct keys revoked

💡

Apply at work: The key to migration success is making it zero-effort for teams. They change two lines of code: base_url and api_key. Everything else (model names, request format, response format) stays identical. If you require teams to change their code beyond these two lines, adoption will stall.

Multi-Region Deployment

For organizations with global users or strict data residency requirements, deploy the gateway in multiple regions:

# Multi-region gateway architecture
#
# US Users ──> [Gateway US-East] ──> OpenAI (US)
#                                ──> Anthropic (US)
#                                ──> Local LLM (US datacenter)
#
# EU Users ──> [Gateway EU-West] ──> Azure OpenAI (EU)
#                                ──> Anthropic (EU endpoint)
#                                ──> Local LLM (EU datacenter)
#
# Shared state:
# - Redis Cluster (cross-region replication for rate limits)
# - PostgreSQL (primary US, read replica EU for dashboards)
# - S3 (audit logs in region-specific buckets)

REGION_CONFIG = {
    "us-east-1": {
        "providers": ["openai", "anthropic", "google", "local-us"],
        "redis": "redis://redis-us.internal:6379",
        "database": "postgresql://db-us.internal/gateway",
        "audit_bucket": "s3://gateway-audit-us",
    },
    "eu-west-1": {
        "providers": ["azure-openai-eu", "anthropic-eu", "local-eu"],
        "redis": "redis://redis-eu.internal:6379",
        "database": "postgresql://db-eu.internal/gateway",
        "audit_bucket": "s3://gateway-audit-eu",
    },
}

# DNS-based routing: ai-gateway.company.com
# GeoDNS resolves to nearest region
# OR: explicit region headers from clients
# X-Gateway-Region: eu-west-1

📝

Production reality: Most organizations under 500 employees only need one region. Deploy multi-region only when you have GDPR requirements with EU users, or latency requirements that mandate regional proximity. Each region adds operational complexity (config sync, failover, data consistency).

Frequently Asked Questions

Should I build a custom gateway or use LiteLLM/Portkey? ▼

Start with LiteLLM Proxy if you need multi-provider routing and want to self-host. It covers 80% of gateway needs (routing, fallback, basic cost tracking) and takes 30 minutes to deploy. Build custom only if you have specific compliance requirements (data residency, custom PII rules) or need deep integration with internal systems (SSO, internal billing, custom audit formats). Portkey is best if you want zero infrastructure management and are comfortable with your data passing through a third party.

How much latency does the gateway add? ▼

A well-built gateway adds 5-15ms of overhead per request. For cache hits, responses return in under 5ms (vs 500-2000ms from the LLM provider). The latency comes from: key validation (0.5ms), rate limit check (0.5ms), PII scan (2ms), cache lookup (1-3ms), and routing logic (0.1ms). This is negligible compared to LLM response times. Use connection pooling, HTTP/2, async Redis, and parallel pre-checks to stay under 15ms.

How do I handle streaming responses through the gateway? ▼

Forward Server-Sent Events (SSE) chunks immediately as they arrive from the provider. Do not buffer the entire response. Track token usage by accumulating content chunks during the stream and recording the cost after the stream completes. Some providers include usage data in the final SSE chunk; for others, estimate output tokens from the accumulated content length (divide by 4 for English text). Set Cache-Control: no-cache and Connection: keep-alive headers.

What is the best way to handle API key rotation? ▼

Set gateway keys to expire every 90 days. When rotating: create the new key first, then set the old key to expire in 24 hours (grace period). This means both keys work during the transition window, so teams can update their config without downtime. Send Slack reminders 14 days, 7 days, and 1 day before expiry. For provider keys (OpenAI, Anthropic), rotate quarterly and store in a secrets manager with automatic injection into the gateway config.

How do I convince teams to migrate to the gateway? ▼

Make migration a 2-line code change (base_url and api_key). Show teams the cost dashboard after week 1 — most are shocked at what they are spending. Offer benefits they cannot get with direct API calls: automatic failover (no more OpenAI outage incidents), cost visibility (see exactly which features cost what), and free caching (30% fewer API calls with zero effort). Start with a friendly pilot team, get their testimonial, then use social proof to convince the rest.

Should I cache streaming responses? ▼

Generally no. Streaming requests are interactive and conversational, where users expect unique responses. Cache non-streaming requests (batch classification, document processing, templated prompts). The one exception: if you detect the same streaming request repeated within seconds (user refreshing the page), serve the complete cached response as a non-streaming response with a header indicating the cache hit.

How do I handle semantic caching for multi-turn conversations? ▼

Semantic caching works best for single-turn queries (classification, Q&A, summarization). For multi-turn conversations, the context changes with every message, making cache hits unlikely. Use exact-match caching for the system prompt portion (which often stays the same) and skip caching for the user message portion. Some organizations cache at the conversation-step level: if the first 3 messages match, the 4th response might be cacheable.

What infrastructure do I need to run a gateway? ▼

Minimum: 2 gateway instances (for HA) behind a load balancer, 1 Redis instance (for rate limits and cache), 1 PostgreSQL instance (for cost tracking and audit logs). This handles up to 10,000 requests/minute. At scale (50K+ RPM), add Redis Cluster (3 nodes), PostgreSQL with read replicas, and 4-8 gateway instances. Total infrastructure cost for a medium organization: $200-500/month on AWS, which pays for itself through 5-10% cost savings on AI API spend.

Course Summary

You now have everything you need to build and deploy a production AI gateway:

Lesson	Component Built	Key Outcome
1. Gateway Architecture	Request lifecycle, minimal gateway	Understand the 9-step request flow and build-vs-buy decision
2. Multi-Provider Routing	Router with fallback chains	99.99% uptime through automatic provider failover
3. Rate Limiting	Redis sliding window limiter	Fair quota distribution across teams with burst protection
4. Cost Control	Budget enforcer, spending dashboard	Per-team budgets with alerts, zero surprise bills
5. Security	PII filter, audit logger, key manager	SOC2/HIPAA compliance with data residency routing
6. Caching	Exact match + semantic cache	30-60% cost reduction, sub-5ms cache hit latency
7. Best Practices	Deployment checklist, migration guide	4-week migration plan from direct API calls to gateway

💡

Your next step: Deploy a LiteLLM proxy in Docker, issue gateway keys to your team, and route your first request through it. You will have a working gateway in 30 minutes. Then layer in the components from this course (rate limiting, cost tracking, PII filtering) one at a time over the next 4 weeks.

← Previous Caching & Performance Course Home → Back to Overview