Intermediate

Rate Limiting for AI APIs

Implement token-aware rate limiting systems that manage AI API consumption fairly across tenants, prevent cost overruns, and ensure service availability under high demand.

Token-Based Rate Limiting

Unlike traditional APIs where requests are roughly equal in cost, AI API requests vary dramatically based on input and output token counts. Rate limiting must account for this variability:

Dimension	Metric	Typical Limit
Tokens Per Minute	Input + output tokens consumed	100K-1M TPM per tenant
Requests Per Minute	Total API calls regardless of size	60-500 RPM per key
Concurrent Requests	Simultaneous in-flight requests	10-50 per tenant
Daily Token Budget	Total daily token consumption	1M-100M per tenant

✅

Critical Challenge: Output tokens are unknown at request time. Use token estimation based on max_tokens parameter for pre-request limiting, then reconcile with actual usage post-response.

Rate Limiting Algorithms

Token Bucket
Tokens refill at a steady rate. Each AI request consumes tokens proportional to its size. Allows controlled bursting while maintaining average throughput limits.
Sliding Window
Track token consumption over a rolling time window. Provides smoother limiting than fixed windows and prevents boundary-burst attacks.
Weighted Fair Queuing
Queue requests from multiple tenants and schedule them based on priority weights and remaining quota, ensuring fair access during contention.
Adaptive Limiting
Dynamically adjust limits based on backend capacity, error rates, and latency. Tighten limits when the system is stressed and relax when capacity is available.

Quota Management

Implement hierarchical quota systems for enterprise AI API management:

Organization Quotas: Total token budget allocated to an organization across all teams and applications
Team Quotas: Sub-allocations within an organization that can be managed by team leads
Application Quotas: Per-application limits to prevent any single app from consuming the entire budget
User Quotas: Individual developer limits for development and testing environments

Implementation Patterns

Redis-Based Limiting

Use Redis sorted sets and Lua scripts for distributed, atomic rate limiting with sub-millisecond performance.

Budget Alerts

Proactive notifications at 50%, 75%, and 90% quota consumption with automatic throttling at 100%.

Graceful Degradation

When limits are reached, route to smaller/cheaper models instead of returning errors for better user experience.

Priority Queues

Implement priority lanes for production versus development traffic, ensuring critical workloads are never starved.

💡

Next Up: In the next lesson, we will explore monetization strategies for AI APIs including pricing models and billing integration.

← PreviousAPI Gateway Next →Monetization

Rate Limiting for AI APIs

Token-Based Rate Limiting

Rate Limiting Algorithms

Token Bucket

Sliding Window

Weighted Fair Queuing

Adaptive Limiting

Quota Management

Implementation Patterns

Redis-Based Limiting

Budget Alerts

Graceful Degradation

Priority Queues