Rate Limiting for AI APIs
Implement token-aware rate limiting systems that manage AI API consumption fairly across tenants, prevent cost overruns, and ensure service availability under high demand.
Token-Based Rate Limiting
Unlike traditional APIs where requests are roughly equal in cost, AI API requests vary dramatically based on input and output token counts. Rate limiting must account for this variability:
| Dimension | Metric | Typical Limit |
|---|---|---|
| Tokens Per Minute | Input + output tokens consumed | 100K-1M TPM per tenant |
| Requests Per Minute | Total API calls regardless of size | 60-500 RPM per key |
| Concurrent Requests | Simultaneous in-flight requests | 10-50 per tenant |
| Daily Token Budget | Total daily token consumption | 1M-100M per tenant |
Rate Limiting Algorithms
Token Bucket
Tokens refill at a steady rate. Each AI request consumes tokens proportional to its size. Allows controlled bursting while maintaining average throughput limits.
Sliding Window
Track token consumption over a rolling time window. Provides smoother limiting than fixed windows and prevents boundary-burst attacks.
Weighted Fair Queuing
Queue requests from multiple tenants and schedule them based on priority weights and remaining quota, ensuring fair access during contention.
Adaptive Limiting
Dynamically adjust limits based on backend capacity, error rates, and latency. Tighten limits when the system is stressed and relax when capacity is available.
Quota Management
Implement hierarchical quota systems for enterprise AI API management:
- Organization Quotas: Total token budget allocated to an organization across all teams and applications
- Team Quotas: Sub-allocations within an organization that can be managed by team leads
- Application Quotas: Per-application limits to prevent any single app from consuming the entire budget
- User Quotas: Individual developer limits for development and testing environments
Implementation Patterns
Redis-Based Limiting
Use Redis sorted sets and Lua scripts for distributed, atomic rate limiting with sub-millisecond performance.
Budget Alerts
Proactive notifications at 50%, 75%, and 90% quota consumption with automatic throttling at 100%.
Graceful Degradation
When limits are reached, route to smaller/cheaper models instead of returning errors for better user experience.
Priority Queues
Implement priority lanes for production versus development traffic, ensuring critical workloads are never starved.
Lilly Tech Systems