Beginner

AI Gateway Architecture

Every organization that uses AI APIs eventually faces the same problem: dozens of teams calling OpenAI, Anthropic, and Google directly, with no visibility into costs, no rate limiting, and no way to enforce security policies. An AI gateway solves all of this by creating a single control plane for every AI API call in your organization.

Why You Need an AI Gateway

Without a gateway, every team manages their own API keys, tracks their own costs, and builds their own retry logic. This creates problems that grow exponentially with team count:

Problem	Without Gateway	With Gateway
API Key Management	Keys scattered across repos, .env files, Slack messages	Single vault, rotated automatically, never exposed to apps
Cost Visibility	Monthly surprise bills, no per-team attribution	Real-time dashboards, per-team budgets, alerts at 80%
Rate Limiting	One team's batch job exhausts org-wide quotas	Per-team limits, priority queues, burst protection
Compliance	No audit trail, PII sent to external APIs unchecked	Every request logged, PII filtered, data residency enforced
Reliability	Each app builds its own retry/fallback logic	Automatic failover across providers, zero app changes

Gateway Components

A production AI gateway has six core components. Each handles a specific concern in the request lifecycle:

# AI Gateway request lifecycle (pseudocode)

async def handle_request(request):
    # 1. AUTHENTICATION - Verify the caller
    app = authenticate(request.headers["X-Gateway-Key"])
    if not app:
        return Response(401, "Invalid gateway key")

    # 2. RATE LIMITING - Check quotas before processing
    allowed = await rate_limiter.check(
        app_id=app.id,
        team_id=app.team_id,
        tokens_requested=estimate_tokens(request.body)
    )
    if not allowed:
        return Response(429, "Rate limit exceeded", retry_after=allowed.retry_after)

    # 3. SECURITY - Scrub sensitive data from the request
    sanitized_body = pii_filter.scrub(request.body)
    audit_logger.log_request(app, sanitized_body)

    # 4. ROUTING - Pick the best provider for this request
    provider = router.select(
        model=request.body["model"],
        priority=app.priority,
        requirements=request.body.get("requirements", {})
    )

    # 5. CACHING - Check if we have a cached response
    cache_key = cache.compute_key(sanitized_body)
    cached = await cache.get(cache_key)
    if cached:
        cost_tracker.record(app, cached.tokens, cached.cost, source="cache")
        return cached.response

    # 6. PROXY - Forward to provider and track costs
    response = await provider.complete(sanitized_body)
    cost = cost_tracker.calculate(provider, response.usage)
    cost_tracker.record(app, response.usage, cost, source="api")

    await cache.set(cache_key, response, ttl=3600)
    return response

Build vs Buy Decision

The gateway space has matured significantly. Here is how the main options compare for production use:

Option	Best For	Pros	Cons
LiteLLM Proxy	Teams wanting open-source with full control	100+ model providers, OpenAI-compatible API, self-hosted, Python	You own ops, scaling, and uptime
Portkey	Teams wanting managed service with analytics	Real-time analytics, guardrails, caching built-in, low latency	SaaS dependency, data leaves your network
AWS Bedrock	AWS-native organizations	IAM integration, VPC endpoints, no API key management	Limited to Bedrock models, AWS lock-in
Custom Gateway	Unique compliance or routing requirements	Total control over every component	6-12 months to build properly, ongoing maintenance

💡

Apply at work: Start with LiteLLM Proxy if you need multi-provider support and want to self-host. It takes about 30 minutes to deploy with Docker and gives you an OpenAI-compatible endpoint that routes to any provider. You can always build custom components on top of it.

Gateway vs Direct API Calls

The most common objection to gateways is "it adds latency." Here is the real comparison:

# Direct API call - what most teams start with
import openai

client = openai.OpenAI(api_key="sk-proj-xxx")  # Key in code/env

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)
# No cost tracking, no rate limiting, no audit log, no fallback

# -------------------------------------------------------

# Through AI gateway - same OpenAI SDK, different base URL
import openai

client = openai.OpenAI(
    api_key="gw-team-frontend-abc123",  # Gateway key, not provider key
    base_url="https://ai-gateway.internal.company.com/v1"
)

response = client.chat.completions.create(
    model="gpt-4o",  # Gateway resolves to best available provider
    messages=[{"role": "user", "content": prompt}]
)
# Cost tracked, rate limited, PII filtered, audit logged, cached

The gateway adds 5-15ms of latency (negligible vs 500-2000ms LLM response times), but saves teams weeks of building infrastructure code and prevents thousands of dollars in wasted API costs.

Minimal Gateway in 50 Lines

Here is the simplest possible AI gateway using FastAPI. This is not production-ready, but it shows the core pattern every gateway follows:

from fastapi import FastAPI, Request, HTTPException
from httpx import AsyncClient
import time, json

app = FastAPI()
http = AsyncClient()

# In production: load from database or config service
API_KEYS = {
    "gw-team-a-key": {"team": "frontend", "provider_key": "sk-..."},
    "gw-team-b-key": {"team": "backend", "provider_key": "sk-..."},
}
request_log = []

@app.post("/v1/chat/completions")
async def proxy_completion(request: Request):
    # 1. Authenticate
    gw_key = request.headers.get("Authorization", "").replace("Bearer ", "")
    config = API_KEYS.get(gw_key)
    if not config:
        raise HTTPException(401, "Invalid gateway key")

    body = await request.json()
    start = time.time()

    # 2. Forward to OpenAI (swap provider_key)
    response = await http.post(
        "https://api.openai.com/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {config['provider_key']}",
            "Content-Type": "application/json",
        },
        json=body,
        timeout=120.0
    )

    result = response.json()
    latency = time.time() - start

    # 3. Log everything
    request_log.append({
        "team": config["team"],
        "model": body.get("model"),
        "tokens": result.get("usage", {}),
        "latency_ms": round(latency * 1000),
        "timestamp": time.time(),
    })

    return result

📝

Production reality: The minimal gateway above handles the core proxy pattern, but production gateways need persistent storage for logs, Redis for rate limiting, proper error handling with retries, streaming support for SSE responses, and health checks. We will build all of these components in the following lessons.

Key Takeaways

An AI gateway is a reverse proxy that centralizes authentication, routing, rate limiting, cost tracking, and compliance for all AI API calls.
Without a gateway, organizations lose visibility into costs, scatter API keys across codebases, and duplicate infrastructure logic in every team.
The gateway adds 5-15ms of latency but saves weeks of engineering time and prevents cost overruns.
Start with LiteLLM Proxy for open-source self-hosted, Portkey for managed service, or build custom only when you have unique compliance requirements.
Applications use the OpenAI SDK with a different base_url — no code changes needed beyond swapping the endpoint and API key.

What Is Next

In the next lesson, we will build the multi-provider routing engine — the brain of the gateway. You will implement load balancing across providers, fallback chains for reliability, cost-based routing to minimize spend, and model capability matching to send requests to the right model.

Next → Multi-Provider Routing