Advanced

API Best Practices for AI Products

Production-grade patterns for rate limiting, authentication, monitoring, versioning, and scaling AI APIs.

Rate Limiting for AI APIs

AI APIs are expensive to serve. Rate limiting protects your infrastructure and ensures fair usage:

Requests per minute (RPM): Limit the number of API calls per time window.
Tokens per minute (TPM): Limit total tokens processed, accounting for variable request sizes.
Concurrent requests: Limit simultaneous in-flight requests to protect GPU resources.
Tiered limits: Offer different rate limits for free, pro, and enterprise tiers.

# Rate limit headers to include in every response
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1700000060
X-RateLimit-Limit-Tokens: 100000
X-RateLimit-Remaining-Tokens: 85000

Authentication and Security

Method	Use Case	Pros
API Keys	Server-to-server calls	Simple, widely understood
OAuth 2.0	User-delegated access	Granular scopes, token rotation
JWT	Stateless authentication	No server-side session storage
mTLS	High-security environments	Certificate-based, mutual verification

⚠

Never expose API keys in client-side code. AI API keys give access to expensive resources. Always proxy requests through your backend server. Implement key rotation, usage monitoring, and anomaly detection to catch compromised keys quickly.

Monitoring AI APIs

Latency Metrics

Track P50, P95, and P99 latency. Monitor time-to-first-token (TTFT) separately from total response time for streaming endpoints.

Error Rates

Track error rates by type (4xx vs 5xx), by model, and by endpoint. Alert on sudden spikes that indicate model or infrastructure issues.

Token Usage

Monitor tokens per request, tokens per user, and total daily consumption. Track cost per request for capacity planning.

Model Quality

Log input/output samples for quality monitoring. Track user feedback, regeneration rates, and content filter triggers.

Scaling Strategies

Request queuing: Queue requests when GPU capacity is full rather than rejecting them. Return estimated wait times.
Auto-scaling: Scale GPU instances based on queue depth and latency metrics, not just CPU/memory.
Caching: Cache identical requests (same prompt + parameters) to avoid redundant inference.
Batch inference: Combine multiple requests into GPU batches for higher throughput at the cost of slightly higher latency.
Model optimization: Use quantization, distillation, and speculative decoding to reduce per-request compute costs.

Frequently Asked Questions

The industry standard is per-token pricing with different rates for input and output tokens. Offer a free tier with low limits for experimentation, a pay-as-you-go tier for growing usage, and enterprise contracts for high-volume customers. Track your GPU cost per token to set profitable margins.

Separate API versioning (endpoint format) from model versioning (AI behavior). Use URL path versioning for API changes (/v1/, /v2/) and model parameters for model versions. Provide model aliases ("gpt-4-latest") that auto-update and pinned versions ("gpt-4-0125") for stability. Give deprecation notices at least 3 months before sunsetting model versions.

Start with 99.9% uptime (8.7 hours downtime/year) for production tiers. Define latency SLAs separately (e.g., P99 latency under 5 seconds for standard models). Exclude planned maintenance windows. Offer SLA credits for violations. Be transparent about GPU capacity constraints during peak usage.

Layer multiple defenses: rate limiting, content filtering, anomaly detection on usage patterns, IP reputation checking, and request fingerprinting. Monitor for prompt injection attacks and automated scraping. Implement usage alerts that notify users of unusual spikes. Require payment verification for higher-tier access.

← Previous Documentation