Intermediate

Azure OpenAI Deployment

Learn to create model deployments, manage versions and quotas, and architect multi-region deployments for high availability.

Creating a Deployment

# Azure CLI
az cognitiveservices account deployment create \
  --name my-openai-resource \
  --resource-group ai-rg \
  --deployment-name gpt4o-deployment \
  --model-name gpt-4o \
  --model-version "2024-08-06" \
  --model-format OpenAI \
  --sku-capacity 80 \
  --sku-name Standard

# Python SDK
from openai import AzureOpenAI
client = AzureOpenAI(
    azure_endpoint="https://my-openai-resource.openai.azure.com/",
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-06-01"
)

Deployment Types

Type	Pricing	Latency	Best For
Standard	Pay-per-token	Variable	Development, variable traffic
Provisioned (PTU)	Reserved throughput	Guaranteed low	Production, consistent traffic
Global Standard	Pay-per-token	Optimized routing	Global apps, best availability
Data Zone Standard	Pay-per-token	Variable	Data residency requirements

Multi-Region Architecture

Active-passive: Primary region handles traffic; secondary region for failover with same model deployment
Active-active: Load balance across regions using Azure API Management or Azure Front Door
Quota spreading: Distribute quota across regions to maximize total available throughput
Model availability: Not all models are available in all regions; check regional availability first

Version Management

Pin versions: Always specify a model version to avoid unexpected behavior changes
Auto-upgrade policy: Configure deployments to auto-upgrade on default version changes or require manual upgrade
Testing: Test new model versions in a staging deployment before updating production
Retirement: Monitor model retirement dates and plan migrations well in advance

✅

Pro tip: Deploy the same model across multiple Azure regions and use API Management with a load-balancing policy to distribute traffic. This gives you higher aggregate throughput (quotas are per-region) and built-in disaster recovery.

← PreviousIntroduction Next →Scaling