Intermediate

Rate Limiting for AI APIs

Implement token-aware rate limiting systems that manage AI API consumption fairly across tenants, prevent cost overruns, and ensure service availability under high demand.

Token-Based Rate Limiting

Unlike traditional APIs where requests are roughly equal in cost, AI API requests vary dramatically based on input and output token counts. Rate limiting must account for this variability:

DimensionMetricTypical Limit
Tokens Per MinuteInput + output tokens consumed100K-1M TPM per tenant
Requests Per MinuteTotal API calls regardless of size60-500 RPM per key
Concurrent RequestsSimultaneous in-flight requests10-50 per tenant
Daily Token BudgetTotal daily token consumption1M-100M per tenant
Critical Challenge: Output tokens are unknown at request time. Use token estimation based on max_tokens parameter for pre-request limiting, then reconcile with actual usage post-response.

Rate Limiting Algorithms

  1. Token Bucket

    Tokens refill at a steady rate. Each AI request consumes tokens proportional to its size. Allows controlled bursting while maintaining average throughput limits.

  2. Sliding Window

    Track token consumption over a rolling time window. Provides smoother limiting than fixed windows and prevents boundary-burst attacks.

  3. Weighted Fair Queuing

    Queue requests from multiple tenants and schedule them based on priority weights and remaining quota, ensuring fair access during contention.

  4. Adaptive Limiting

    Dynamically adjust limits based on backend capacity, error rates, and latency. Tighten limits when the system is stressed and relax when capacity is available.

Quota Management

Implement hierarchical quota systems for enterprise AI API management:

  • Organization Quotas: Total token budget allocated to an organization across all teams and applications
  • Team Quotas: Sub-allocations within an organization that can be managed by team leads
  • Application Quotas: Per-application limits to prevent any single app from consuming the entire budget
  • User Quotas: Individual developer limits for development and testing environments

Implementation Patterns

Redis-Based Limiting

Use Redis sorted sets and Lua scripts for distributed, atomic rate limiting with sub-millisecond performance.

Budget Alerts

Proactive notifications at 50%, 75%, and 90% quota consumption with automatic throttling at 100%.

Graceful Degradation

When limits are reached, route to smaller/cheaper models instead of returning errors for better user experience.

Priority Queues

Implement priority lanes for production versus development traffic, ensuring critical workloads are never starved.

💡
Next Up: In the next lesson, we will explore monetization strategies for AI APIs including pricing models and billing integration.