Beginner

REST API Design for AI Products

Design intuitive RESTful APIs for serving ML models with proper endpoint structure, request/response schemas, versioning, and error handling.

Endpoint Design for AI

AI APIs typically follow a resource-action pattern rather than pure CRUD:

# Prediction endpoints
POST /v1/models/{model_id}/predict
POST /v1/chat/completions
POST /v1/embeddings
POST /v1/images/generations

# Async/batch endpoints
POST /v1/batches
GET  /v1/batches/{batch_id}
GET  /v1/batches/{batch_id}/results

# Model management
GET  /v1/models
GET  /v1/models/{model_id}
POST /v1/fine-tuning/jobs
GET  /v1/fine-tuning/jobs/{job_id}
💡
Use POST for predictions: Even though predictions are "read-like" operations, use POST because request bodies (prompts, images, parameters) are too complex for query strings. This follows the pattern established by OpenAI, Anthropic, and other major AI API providers.

Request Schema Design

A well-designed AI API request includes the input data and model control parameters:

# Request body for a chat completion API
{
  "model": "gpt-4",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing."}
  ],
  "temperature": 0.7,
  "max_tokens": 1024,
  "top_p": 1.0,
  "stream": false,
  "response_format": {"type": "json_object"}
}

Response Schema Design

Include the prediction result, metadata, and usage information:

# Response body
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1700000000,
  "model": "gpt-4-0125",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing uses quantum bits..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 150,
    "total_tokens": 175
  }
}

Error Handling for AI APIs

AI APIs need richer error responses than traditional APIs:

Status CodeMeaningAI-Specific Use
400Bad RequestInvalid prompt format, unsupported parameters
401UnauthorizedInvalid or missing API key
413Payload Too LargeInput exceeds model's context window
422UnprocessableContent policy violation, unsafe input
429Too Many RequestsRate limit exceeded (include Retry-After)
503Service UnavailableModel loading, GPU capacity exhausted
# Error response format
{
  "error": {
    "type": "rate_limit_error",
    "message": "Rate limit exceeded. Please retry after 30 seconds.",
    "code": "rate_limit_exceeded",
    "param": null,
    "retry_after": 30
  }
}

API Versioning Strategies

AI APIs need to handle both API version changes and model version changes:

  • URL path versioning: /v1/completions — most common, clear, and cacheable.
  • Model parameter versioning: "model": "gpt-4-0125" — pin to specific model snapshots.
  • Date-based versioning: Anthropic-Version: 2024-01-01 — header-based API behavior versioning.
Separate API and model versions: API versioning controls the request/response format. Model versioning controls the AI behavior. Allow users to pin both independently for maximum stability.

Building with FastAPI

FastAPI is the most popular framework for building AI APIs in Python:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field

app = FastAPI(title="AI Prediction API", version="1.0.0")

class PredictionRequest(BaseModel):
    text: str = Field(..., max_length=4096)
    model: str = Field(default="default-v1")
    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
    max_tokens: int = Field(default=256, ge=1, le=4096)

class PredictionResponse(BaseModel):
    id: str
    prediction: str
    confidence: float
    model: str
    usage: dict

@app.post("/v1/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    result = await run_inference(request)
    return PredictionResponse(
        id=generate_id(),
        prediction=result.text,
        confidence=result.score,
        model=request.model,
        usage={"tokens": result.token_count}
    )