Beginner
REST API Design for AI Products
Design intuitive RESTful APIs for serving ML models with proper endpoint structure, request/response schemas, versioning, and error handling.
Endpoint Design for AI
AI APIs typically follow a resource-action pattern rather than pure CRUD:
# Prediction endpoints
POST /v1/models/{model_id}/predict
POST /v1/chat/completions
POST /v1/embeddings
POST /v1/images/generations
# Async/batch endpoints
POST /v1/batches
GET /v1/batches/{batch_id}
GET /v1/batches/{batch_id}/results
# Model management
GET /v1/models
GET /v1/models/{model_id}
POST /v1/fine-tuning/jobs
GET /v1/fine-tuning/jobs/{job_id}
Use POST for predictions: Even though predictions are "read-like" operations, use POST because request bodies (prompts, images, parameters) are too complex for query strings. This follows the pattern established by OpenAI, Anthropic, and other major AI API providers.
Request Schema Design
A well-designed AI API request includes the input data and model control parameters:
# Request body for a chat completion API
{
"model": "gpt-4",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing."}
],
"temperature": 0.7,
"max_tokens": 1024,
"top_p": 1.0,
"stream": false,
"response_format": {"type": "json_object"}
}
Response Schema Design
Include the prediction result, metadata, and usage information:
# Response body
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1700000000,
"model": "gpt-4-0125",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Quantum computing uses quantum bits..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 150,
"total_tokens": 175
}
}
Error Handling for AI APIs
AI APIs need richer error responses than traditional APIs:
| Status Code | Meaning | AI-Specific Use |
|---|---|---|
| 400 | Bad Request | Invalid prompt format, unsupported parameters |
| 401 | Unauthorized | Invalid or missing API key |
| 413 | Payload Too Large | Input exceeds model's context window |
| 422 | Unprocessable | Content policy violation, unsafe input |
| 429 | Too Many Requests | Rate limit exceeded (include Retry-After) |
| 503 | Service Unavailable | Model loading, GPU capacity exhausted |
# Error response format
{
"error": {
"type": "rate_limit_error",
"message": "Rate limit exceeded. Please retry after 30 seconds.",
"code": "rate_limit_exceeded",
"param": null,
"retry_after": 30
}
}
API Versioning Strategies
AI APIs need to handle both API version changes and model version changes:
- URL path versioning:
/v1/completions— most common, clear, and cacheable. - Model parameter versioning:
"model": "gpt-4-0125"— pin to specific model snapshots. - Date-based versioning:
Anthropic-Version: 2024-01-01— header-based API behavior versioning.
Separate API and model versions: API versioning controls the request/response format. Model versioning controls the AI behavior. Allow users to pin both independently for maximum stability.
Building with FastAPI
FastAPI is the most popular framework for building AI APIs in Python:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
app = FastAPI(title="AI Prediction API", version="1.0.0")
class PredictionRequest(BaseModel):
text: str = Field(..., max_length=4096)
model: str = Field(default="default-v1")
temperature: float = Field(default=0.7, ge=0.0, le=2.0)
max_tokens: int = Field(default=256, ge=1, le=4096)
class PredictionResponse(BaseModel):
id: str
prediction: str
confidence: float
model: str
usage: dict
@app.post("/v1/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
result = await run_inference(request)
return PredictionResponse(
id=generate_id(),
prediction=result.text,
confidence=result.score,
model=request.model,
usage={"tokens": result.token_count}
)
Lilly Tech Systems