Best Practices & Checklist
This final lesson consolidates everything into an actionable deployment checklist, a step-by-step migration guide for moving teams from direct API calls to your gateway, multi-region deployment patterns, and answers to the most common questions platform engineers ask when building AI gateways.
Gateway Deployment Checklist
Use this checklist before going live. Each item maps to a lesson in this course:
Infrastructure (Lesson 1)
- Gateway deployed behind a load balancer — minimum 2 instances for high availability
- Redis deployed for state — rate limits, budgets, exact-match cache, with persistence enabled
- PostgreSQL deployed for analytics — cost records, audit logs, usage dashboards
- Health check endpoint —
/healthreturns status of all dependencies (Redis, DB, providers) - Provider API keys in vault — AWS Secrets Manager, HashiCorp Vault, or Azure Key Vault
Routing (Lesson 2)
- At least 2 providers configured — primary + fallback from different vendors
- Fallback chains defined for every model — e.g., GPT-4o falls back to Claude Sonnet
- Circuit breaker enabled — 3 consecutive failures opens circuit for 30s+
- Health check probe for each provider — periodic GET to /models endpoint
- Request format translation tested — OpenAI to Anthropic and Google formats verified
Rate Limiting (Lesson 3)
- Three-level limits configured — organization, team, and per-app
- Token-based limits enabled — not just request-per-minute
- Rate limit headers returned — X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After
- Burst policy defined — reject, queue, or borrow from org pool
Cost Control (Lesson 4)
- Pricing table updated — matches current provider pricing as of this month
- Daily and monthly budgets set per team — warning at 80%, hard limit at 100%
- Cost tracking recording every request — team, user, model, tokens, cost in PostgreSQL
- Spending dashboard deployed — daily/weekly cost by team, model, user
- Anomaly detection query running — alert when team spends 2x their daily average
Security (Lesson 5)
- Gateway keys issued to every team — no team has direct provider keys
- Key rotation policy set — 90-day expiry with 24-hour grace period
- PII filter enabled — blocks SSNs, credit cards, API keys; redacts emails, phones
- Audit logging enabled — every request logged to immutable storage (S3 Object Lock)
- Data residency rules configured — EU data to EU endpoints, PHI to BAA providers
- TLS 1.3 on all connections — including internal services
Performance (Lesson 6)
- Exact match cache enabled — Redis, 1-hour default TTL
- Connection pooling configured — persistent HTTP/2 connections to providers
- Streaming support verified — SSE forwarding tested with all providers
- Cache hit rate monitoring — daily report, alert if drops below 10%
- Gateway latency under 15ms — measured at p95 excluding provider response time
Migration Guide: Direct API Calls to Gateway
Migrating teams from direct API calls to the gateway is the hardest part. Here is a proven 4-week migration plan:
| Week | Actions | Success Criteria |
|---|---|---|
| Week 1 | Deploy gateway in shadow mode (proxy all requests but also log direct API calls), issue gateway keys to pilot team, set up monitoring dashboard | Pilot team running through gateway with zero errors |
| Week 2 | Enable rate limiting and cost tracking for pilot team, gather feedback, tune limits, fix edge cases (streaming, tool use, multimodal) | Rate limiting and cost tracking accurate within 5% |
| Week 3 | Migrate 2-3 more teams, enable PII filtering, deploy spending dashboard, configure alerts, test failover by blocking primary provider | 50% of API traffic through gateway, failover works |
| Week 4 | Migrate remaining teams, revoke direct provider API keys, enable caching, send weekly cost reports to team leads | 100% of traffic through gateway, direct keys revoked |
base_url and api_key. Everything else (model names, request format, response format) stays identical. If you require teams to change their code beyond these two lines, adoption will stall.Multi-Region Deployment
For organizations with global users or strict data residency requirements, deploy the gateway in multiple regions:
# Multi-region gateway architecture
#
# US Users ──> [Gateway US-East] ──> OpenAI (US)
# ──> Anthropic (US)
# ──> Local LLM (US datacenter)
#
# EU Users ──> [Gateway EU-West] ──> Azure OpenAI (EU)
# ──> Anthropic (EU endpoint)
# ──> Local LLM (EU datacenter)
#
# Shared state:
# - Redis Cluster (cross-region replication for rate limits)
# - PostgreSQL (primary US, read replica EU for dashboards)
# - S3 (audit logs in region-specific buckets)
REGION_CONFIG = {
"us-east-1": {
"providers": ["openai", "anthropic", "google", "local-us"],
"redis": "redis://redis-us.internal:6379",
"database": "postgresql://db-us.internal/gateway",
"audit_bucket": "s3://gateway-audit-us",
},
"eu-west-1": {
"providers": ["azure-openai-eu", "anthropic-eu", "local-eu"],
"redis": "redis://redis-eu.internal:6379",
"database": "postgresql://db-eu.internal/gateway",
"audit_bucket": "s3://gateway-audit-eu",
},
}
# DNS-based routing: ai-gateway.company.com
# GeoDNS resolves to nearest region
# OR: explicit region headers from clients
# X-Gateway-Region: eu-west-1
Frequently Asked Questions
Start with LiteLLM Proxy if you need multi-provider routing and want to self-host. It covers 80% of gateway needs (routing, fallback, basic cost tracking) and takes 30 minutes to deploy. Build custom only if you have specific compliance requirements (data residency, custom PII rules) or need deep integration with internal systems (SSO, internal billing, custom audit formats). Portkey is best if you want zero infrastructure management and are comfortable with your data passing through a third party.
A well-built gateway adds 5-15ms of overhead per request. For cache hits, responses return in under 5ms (vs 500-2000ms from the LLM provider). The latency comes from: key validation (0.5ms), rate limit check (0.5ms), PII scan (2ms), cache lookup (1-3ms), and routing logic (0.1ms). This is negligible compared to LLM response times. Use connection pooling, HTTP/2, async Redis, and parallel pre-checks to stay under 15ms.
Forward Server-Sent Events (SSE) chunks immediately as they arrive from the provider. Do not buffer the entire response. Track token usage by accumulating content chunks during the stream and recording the cost after the stream completes. Some providers include usage data in the final SSE chunk; for others, estimate output tokens from the accumulated content length (divide by 4 for English text). Set Cache-Control: no-cache and Connection: keep-alive headers.
Set gateway keys to expire every 90 days. When rotating: create the new key first, then set the old key to expire in 24 hours (grace period). This means both keys work during the transition window, so teams can update their config without downtime. Send Slack reminders 14 days, 7 days, and 1 day before expiry. For provider keys (OpenAI, Anthropic), rotate quarterly and store in a secrets manager with automatic injection into the gateway config.
Make migration a 2-line code change (base_url and api_key). Show teams the cost dashboard after week 1 — most are shocked at what they are spending. Offer benefits they cannot get with direct API calls: automatic failover (no more OpenAI outage incidents), cost visibility (see exactly which features cost what), and free caching (30% fewer API calls with zero effort). Start with a friendly pilot team, get their testimonial, then use social proof to convince the rest.
Generally no. Streaming requests are interactive and conversational, where users expect unique responses. Cache non-streaming requests (batch classification, document processing, templated prompts). The one exception: if you detect the same streaming request repeated within seconds (user refreshing the page), serve the complete cached response as a non-streaming response with a header indicating the cache hit.
Semantic caching works best for single-turn queries (classification, Q&A, summarization). For multi-turn conversations, the context changes with every message, making cache hits unlikely. Use exact-match caching for the system prompt portion (which often stays the same) and skip caching for the user message portion. Some organizations cache at the conversation-step level: if the first 3 messages match, the 4th response might be cacheable.
Minimum: 2 gateway instances (for HA) behind a load balancer, 1 Redis instance (for rate limits and cache), 1 PostgreSQL instance (for cost tracking and audit logs). This handles up to 10,000 requests/minute. At scale (50K+ RPM), add Redis Cluster (3 nodes), PostgreSQL with read replicas, and 4-8 gateway instances. Total infrastructure cost for a medium organization: $200-500/month on AWS, which pays for itself through 5-10% cost savings on AI API spend.
Course Summary
You now have everything you need to build and deploy a production AI gateway:
| Lesson | Component Built | Key Outcome |
|---|---|---|
| 1. Gateway Architecture | Request lifecycle, minimal gateway | Understand the 9-step request flow and build-vs-buy decision |
| 2. Multi-Provider Routing | Router with fallback chains | 99.99% uptime through automatic provider failover |
| 3. Rate Limiting | Redis sliding window limiter | Fair quota distribution across teams with burst protection |
| 4. Cost Control | Budget enforcer, spending dashboard | Per-team budgets with alerts, zero surprise bills |
| 5. Security | PII filter, audit logger, key manager | SOC2/HIPAA compliance with data residency routing |
| 6. Caching | Exact match + semantic cache | 30-60% cost reduction, sub-5ms cache hit latency |
| 7. Best Practices | Deployment checklist, migration guide | 4-week migration plan from direct API calls to gateway |
Lilly Tech Systems