Beginner

AI Gateway Architecture

Every organization that uses AI APIs eventually faces the same problem: dozens of teams calling OpenAI, Anthropic, and Google directly, with no visibility into costs, no rate limiting, and no way to enforce security policies. An AI gateway solves all of this by creating a single control plane for every AI API call in your organization.

Why You Need an AI Gateway

Without a gateway, every team manages their own API keys, tracks their own costs, and builds their own retry logic. This creates problems that grow exponentially with team count:

ProblemWithout GatewayWith Gateway
API Key Management Keys scattered across repos, .env files, Slack messages Single vault, rotated automatically, never exposed to apps
Cost Visibility Monthly surprise bills, no per-team attribution Real-time dashboards, per-team budgets, alerts at 80%
Rate Limiting One team's batch job exhausts org-wide quotas Per-team limits, priority queues, burst protection
Compliance No audit trail, PII sent to external APIs unchecked Every request logged, PII filtered, data residency enforced
Reliability Each app builds its own retry/fallback logic Automatic failover across providers, zero app changes

Gateway Components

A production AI gateway has six core components. Each handles a specific concern in the request lifecycle:

# AI Gateway request lifecycle (pseudocode)

async def handle_request(request):
    # 1. AUTHENTICATION - Verify the caller
    app = authenticate(request.headers["X-Gateway-Key"])
    if not app:
        return Response(401, "Invalid gateway key")

    # 2. RATE LIMITING - Check quotas before processing
    allowed = await rate_limiter.check(
        app_id=app.id,
        team_id=app.team_id,
        tokens_requested=estimate_tokens(request.body)
    )
    if not allowed:
        return Response(429, "Rate limit exceeded", retry_after=allowed.retry_after)

    # 3. SECURITY - Scrub sensitive data from the request
    sanitized_body = pii_filter.scrub(request.body)
    audit_logger.log_request(app, sanitized_body)

    # 4. ROUTING - Pick the best provider for this request
    provider = router.select(
        model=request.body["model"],
        priority=app.priority,
        requirements=request.body.get("requirements", {})
    )

    # 5. CACHING - Check if we have a cached response
    cache_key = cache.compute_key(sanitized_body)
    cached = await cache.get(cache_key)
    if cached:
        cost_tracker.record(app, cached.tokens, cached.cost, source="cache")
        return cached.response

    # 6. PROXY - Forward to provider and track costs
    response = await provider.complete(sanitized_body)
    cost = cost_tracker.calculate(provider, response.usage)
    cost_tracker.record(app, response.usage, cost, source="api")

    await cache.set(cache_key, response, ttl=3600)
    return response

Build vs Buy Decision

The gateway space has matured significantly. Here is how the main options compare for production use:

OptionBest ForProsCons
LiteLLM Proxy Teams wanting open-source with full control 100+ model providers, OpenAI-compatible API, self-hosted, Python You own ops, scaling, and uptime
Portkey Teams wanting managed service with analytics Real-time analytics, guardrails, caching built-in, low latency SaaS dependency, data leaves your network
AWS Bedrock AWS-native organizations IAM integration, VPC endpoints, no API key management Limited to Bedrock models, AWS lock-in
Custom Gateway Unique compliance or routing requirements Total control over every component 6-12 months to build properly, ongoing maintenance
💡
Apply at work: Start with LiteLLM Proxy if you need multi-provider support and want to self-host. It takes about 30 minutes to deploy with Docker and gives you an OpenAI-compatible endpoint that routes to any provider. You can always build custom components on top of it.

Gateway vs Direct API Calls

The most common objection to gateways is "it adds latency." Here is the real comparison:

# Direct API call - what most teams start with
import openai

client = openai.OpenAI(api_key="sk-proj-xxx")  # Key in code/env

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)
# No cost tracking, no rate limiting, no audit log, no fallback

# -------------------------------------------------------

# Through AI gateway - same OpenAI SDK, different base URL
import openai

client = openai.OpenAI(
    api_key="gw-team-frontend-abc123",  # Gateway key, not provider key
    base_url="https://ai-gateway.internal.company.com/v1"
)

response = client.chat.completions.create(
    model="gpt-4o",  # Gateway resolves to best available provider
    messages=[{"role": "user", "content": prompt}]
)
# Cost tracked, rate limited, PII filtered, audit logged, cached

The gateway adds 5-15ms of latency (negligible vs 500-2000ms LLM response times), but saves teams weeks of building infrastructure code and prevents thousands of dollars in wasted API costs.

Minimal Gateway in 50 Lines

Here is the simplest possible AI gateway using FastAPI. This is not production-ready, but it shows the core pattern every gateway follows:

from fastapi import FastAPI, Request, HTTPException
from httpx import AsyncClient
import time, json

app = FastAPI()
http = AsyncClient()

# In production: load from database or config service
API_KEYS = {
    "gw-team-a-key": {"team": "frontend", "provider_key": "sk-..."},
    "gw-team-b-key": {"team": "backend", "provider_key": "sk-..."},
}
request_log = []

@app.post("/v1/chat/completions")
async def proxy_completion(request: Request):
    # 1. Authenticate
    gw_key = request.headers.get("Authorization", "").replace("Bearer ", "")
    config = API_KEYS.get(gw_key)
    if not config:
        raise HTTPException(401, "Invalid gateway key")

    body = await request.json()
    start = time.time()

    # 2. Forward to OpenAI (swap provider_key)
    response = await http.post(
        "https://api.openai.com/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {config['provider_key']}",
            "Content-Type": "application/json",
        },
        json=body,
        timeout=120.0
    )

    result = response.json()
    latency = time.time() - start

    # 3. Log everything
    request_log.append({
        "team": config["team"],
        "model": body.get("model"),
        "tokens": result.get("usage", {}),
        "latency_ms": round(latency * 1000),
        "timestamp": time.time(),
    })

    return result
📝
Production reality: The minimal gateway above handles the core proxy pattern, but production gateways need persistent storage for logs, Redis for rate limiting, proper error handling with retries, streaming support for SSE responses, and health checks. We will build all of these components in the following lessons.

Key Takeaways

  • An AI gateway is a reverse proxy that centralizes authentication, routing, rate limiting, cost tracking, and compliance for all AI API calls.
  • Without a gateway, organizations lose visibility into costs, scatter API keys across codebases, and duplicate infrastructure logic in every team.
  • The gateway adds 5-15ms of latency but saves weeks of engineering time and prevents cost overruns.
  • Start with LiteLLM Proxy for open-source self-hosted, Portkey for managed service, or build custom only when you have unique compliance requirements.
  • Applications use the OpenAI SDK with a different base_url — no code changes needed beyond swapping the endpoint and API key.

What Is Next

In the next lesson, we will build the multi-provider routing engine — the brain of the gateway. You will implement load balancing across providers, fallback chains for reliability, cost-based routing to minimize spend, and model capability matching to send requests to the right model.