AI Gateway Architecture
Every organization that uses AI APIs eventually faces the same problem: dozens of teams calling OpenAI, Anthropic, and Google directly, with no visibility into costs, no rate limiting, and no way to enforce security policies. An AI gateway solves all of this by creating a single control plane for every AI API call in your organization.
Why You Need an AI Gateway
Without a gateway, every team manages their own API keys, tracks their own costs, and builds their own retry logic. This creates problems that grow exponentially with team count:
| Problem | Without Gateway | With Gateway |
|---|---|---|
| API Key Management | Keys scattered across repos, .env files, Slack messages | Single vault, rotated automatically, never exposed to apps |
| Cost Visibility | Monthly surprise bills, no per-team attribution | Real-time dashboards, per-team budgets, alerts at 80% |
| Rate Limiting | One team's batch job exhausts org-wide quotas | Per-team limits, priority queues, burst protection |
| Compliance | No audit trail, PII sent to external APIs unchecked | Every request logged, PII filtered, data residency enforced |
| Reliability | Each app builds its own retry/fallback logic | Automatic failover across providers, zero app changes |
Gateway Components
A production AI gateway has six core components. Each handles a specific concern in the request lifecycle:
# AI Gateway request lifecycle (pseudocode)
async def handle_request(request):
# 1. AUTHENTICATION - Verify the caller
app = authenticate(request.headers["X-Gateway-Key"])
if not app:
return Response(401, "Invalid gateway key")
# 2. RATE LIMITING - Check quotas before processing
allowed = await rate_limiter.check(
app_id=app.id,
team_id=app.team_id,
tokens_requested=estimate_tokens(request.body)
)
if not allowed:
return Response(429, "Rate limit exceeded", retry_after=allowed.retry_after)
# 3. SECURITY - Scrub sensitive data from the request
sanitized_body = pii_filter.scrub(request.body)
audit_logger.log_request(app, sanitized_body)
# 4. ROUTING - Pick the best provider for this request
provider = router.select(
model=request.body["model"],
priority=app.priority,
requirements=request.body.get("requirements", {})
)
# 5. CACHING - Check if we have a cached response
cache_key = cache.compute_key(sanitized_body)
cached = await cache.get(cache_key)
if cached:
cost_tracker.record(app, cached.tokens, cached.cost, source="cache")
return cached.response
# 6. PROXY - Forward to provider and track costs
response = await provider.complete(sanitized_body)
cost = cost_tracker.calculate(provider, response.usage)
cost_tracker.record(app, response.usage, cost, source="api")
await cache.set(cache_key, response, ttl=3600)
return response
Build vs Buy Decision
The gateway space has matured significantly. Here is how the main options compare for production use:
| Option | Best For | Pros | Cons |
|---|---|---|---|
| LiteLLM Proxy | Teams wanting open-source with full control | 100+ model providers, OpenAI-compatible API, self-hosted, Python | You own ops, scaling, and uptime |
| Portkey | Teams wanting managed service with analytics | Real-time analytics, guardrails, caching built-in, low latency | SaaS dependency, data leaves your network |
| AWS Bedrock | AWS-native organizations | IAM integration, VPC endpoints, no API key management | Limited to Bedrock models, AWS lock-in |
| Custom Gateway | Unique compliance or routing requirements | Total control over every component | 6-12 months to build properly, ongoing maintenance |
Gateway vs Direct API Calls
The most common objection to gateways is "it adds latency." Here is the real comparison:
# Direct API call - what most teams start with
import openai
client = openai.OpenAI(api_key="sk-proj-xxx") # Key in code/env
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
# No cost tracking, no rate limiting, no audit log, no fallback
# -------------------------------------------------------
# Through AI gateway - same OpenAI SDK, different base URL
import openai
client = openai.OpenAI(
api_key="gw-team-frontend-abc123", # Gateway key, not provider key
base_url="https://ai-gateway.internal.company.com/v1"
)
response = client.chat.completions.create(
model="gpt-4o", # Gateway resolves to best available provider
messages=[{"role": "user", "content": prompt}]
)
# Cost tracked, rate limited, PII filtered, audit logged, cached
The gateway adds 5-15ms of latency (negligible vs 500-2000ms LLM response times), but saves teams weeks of building infrastructure code and prevents thousands of dollars in wasted API costs.
Minimal Gateway in 50 Lines
Here is the simplest possible AI gateway using FastAPI. This is not production-ready, but it shows the core pattern every gateway follows:
from fastapi import FastAPI, Request, HTTPException
from httpx import AsyncClient
import time, json
app = FastAPI()
http = AsyncClient()
# In production: load from database or config service
API_KEYS = {
"gw-team-a-key": {"team": "frontend", "provider_key": "sk-..."},
"gw-team-b-key": {"team": "backend", "provider_key": "sk-..."},
}
request_log = []
@app.post("/v1/chat/completions")
async def proxy_completion(request: Request):
# 1. Authenticate
gw_key = request.headers.get("Authorization", "").replace("Bearer ", "")
config = API_KEYS.get(gw_key)
if not config:
raise HTTPException(401, "Invalid gateway key")
body = await request.json()
start = time.time()
# 2. Forward to OpenAI (swap provider_key)
response = await http.post(
"https://api.openai.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {config['provider_key']}",
"Content-Type": "application/json",
},
json=body,
timeout=120.0
)
result = response.json()
latency = time.time() - start
# 3. Log everything
request_log.append({
"team": config["team"],
"model": body.get("model"),
"tokens": result.get("usage", {}),
"latency_ms": round(latency * 1000),
"timestamp": time.time(),
})
return result
Key Takeaways
- An AI gateway is a reverse proxy that centralizes authentication, routing, rate limiting, cost tracking, and compliance for all AI API calls.
- Without a gateway, organizations lose visibility into costs, scatter API keys across codebases, and duplicate infrastructure logic in every team.
- The gateway adds 5-15ms of latency but saves weeks of engineering time and prevents cost overruns.
- Start with LiteLLM Proxy for open-source self-hosted, Portkey for managed service, or build custom only when you have unique compliance requirements.
- Applications use the OpenAI SDK with a different base_url — no code changes needed beyond swapping the endpoint and API key.
What Is Next
In the next lesson, we will build the multi-provider routing engine — the brain of the gateway. You will implement load balancing across providers, fallback chains for reliability, cost-based routing to minimize spend, and model capability matching to send requests to the right model.