Advanced
Step 5: Deploy to Production
Ship your voice assistant. You will containerize the entire application with Docker, configure production settings, optimize end-to-end latency, manage concurrent WebSocket sessions, and set up monitoring and cost tracking.
Production Docker Setup
Create a production-ready Dockerfile with multi-stage build for smaller image size:
# Dockerfile
FROM python:3.11-slim as builder
WORKDIR /app
# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
FROM python:3.11-slim
WORKDIR /app
# Copy installed packages from builder
COPY --from=builder /install /usr/local
# Copy application code
COPY app/ ./app/
COPY frontend/ ./frontend/
# Create non-root user
RUN useradd --create-home appuser
USER appuser
# Expose port
EXPOSE 8000
# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD python -c "import httpx; r = httpx.get('http://localhost:8000/health'); assert r.status_code == 200"
# Run with production settings
CMD ["uvicorn", "app.main:app", \
"--host", "0.0.0.0", \
"--port", "8000", \
"--workers", "1", \
"--ws-max-size", "16777216", \
"--timeout-keep-alive", "120"]
Why 1 Worker? WebSocket connections are stateful and tied to a specific worker process. With multiple workers, a client might reconnect to a different worker and lose their conversation state. For voice assistants, use 1 worker per container and scale horizontally with multiple containers behind a load balancer that supports sticky sessions.
Docker Compose for Production
# docker-compose.yml
version: "3.8"
services:
voice-assistant:
build: .
container_name: voice-assistant
ports:
- "8000:8000"
env_file:
- .env
volumes:
- ./frontend:/app/frontend:ro
restart: unless-stopped
deploy:
resources:
limits:
memory: 512M
cpus: "1.0"
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
# Reverse proxy with SSL termination
nginx:
image: nginx:alpine
container_name: voice-nginx
ports:
- "443:443"
- "80:80"
volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
- ./nginx/certs:/etc/nginx/certs:ro
depends_on:
- voice-assistant
restart: unless-stopped
Nginx Configuration for WebSockets
WebSocket connections require special nginx configuration for proxying:
# nginx/nginx.conf
events {
worker_connections 1024;
}
http {
# WebSocket upgrade map
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
upstream voice_backend {
server voice-assistant:8000;
}
server {
listen 80;
server_name your-domain.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl;
server_name your-domain.com;
ssl_certificate /etc/nginx/certs/fullchain.pem;
ssl_certificate_key /etc/nginx/certs/privkey.pem;
# WebSocket endpoint
location /ws/ {
proxy_pass http://voice_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# Timeouts for long-lived WebSocket connections
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
# Increase buffer sizes for audio data
proxy_buffering off;
proxy_buffer_size 16k;
}
# HTTP endpoints
location / {
proxy_pass http://voice_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
}
Latency Optimization
Voice assistants live and die by latency. Here is a breakdown of where time is spent and how to optimize each stage:
# Latency Budget (target: under 2 seconds total)
#
# Stage | Typical | Optimized | Technique
# -------------------|-----------|-----------|---------------------------
# Audio capture | 800ms | 500ms | Reduce silence threshold
# Network (upload) | 50ms | 50ms | Binary WebSocket frames
# Whisper ASR | 500ms | 300ms | Use whisper-1 API, short audio
# Network (API) | 100ms | 50ms | Keep-alive connections
# LLM (first token) | 500ms | 300ms | GPT-4o-mini, streaming
# TTS (first chunk) | 400ms | 200ms | ElevenLabs Turbo, streaming
# Network (download) | 50ms | 50ms | Binary WebSocket frames
# Audio playback | 0ms | 0ms | Starts immediately
# -------------------|-----------|-----------|---------------------------
# TOTAL | 2400ms | 1450ms |
Key optimization strategies:
# app/optimizations.py
"""Latency optimization utilities."""
import asyncio
from typing import AsyncGenerator
async def pipeline_with_overlap(
llm_generator: AsyncGenerator[str, None],
tts_speak_func,
websocket
) -> None:
"""Run LLM generation and TTS synthesis with overlap.
Instead of waiting for the full LLM response before
starting TTS, we buffer tokens into sentences and
start TTS as soon as the first sentence is complete.
This overlaps LLM generation of sentence N+1 with
TTS synthesis and playback of sentence N.
"""
sentence_buffer = ""
sentence_enders = {'.', '!', '?'}
async for token in llm_generator:
sentence_buffer += token
# Send text to client for display
await websocket.send_json({
"type": "response_text",
"text": token
})
# Check if sentence is complete
if sentence_buffer.strip() and sentence_buffer.strip()[-1] in sentence_enders:
sentence = sentence_buffer.strip()
sentence_buffer = ""
# Synthesize and stream audio for this sentence
async for audio_chunk in tts_speak_func(sentence):
await websocket.send_bytes(audio_chunk)
# Handle remaining text
if sentence_buffer.strip():
async for audio_chunk in tts_speak_func(sentence_buffer.strip()):
await websocket.send_bytes(audio_chunk)
class ConnectionPool:
"""Reusable HTTP connection pool for API calls.
Creating a new HTTPS connection for every API call adds
~100ms of TLS handshake time. Reusing connections eliminates
this overhead.
"""
_instance = None
def __new__(cls):
if cls._instance is None:
import httpx
cls._instance = super().__new__(cls)
cls._instance.client = httpx.AsyncClient(
timeout=30.0,
limits=httpx.Limits(
max_connections=20,
max_keepalive_connections=10,
keepalive_expiry=300
),
http2=True # HTTP/2 for multiplexing
)
return cls._instance
async def close(self):
await self.client.aclose()
Concurrent Session Management
# app/session_manager.py
"""Manage concurrent voice sessions."""
import logging
from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict
import asyncio
logger = logging.getLogger(__name__)
# Maximum concurrent sessions
MAX_SESSIONS = 50
@dataclass
class SessionInfo:
"""Track information about an active session."""
session_id: str
connected_at: datetime = field(default_factory=datetime.utcnow)
message_count: int = 0
total_audio_bytes: int = 0
class SessionManager:
"""Manage concurrent voice WebSocket sessions.
Tracks active sessions, enforces limits, and provides
metrics for monitoring.
"""
def __init__(self, max_sessions: int = MAX_SESSIONS):
self.max_sessions = max_sessions
self._sessions: Dict[str, SessionInfo] = {}
self._lock = asyncio.Lock()
async def register(self, session_id: str) -> bool:
"""Register a new session. Returns False if at capacity."""
async with self._lock:
if len(self._sessions) >= self.max_sessions:
logger.warning(
f"Session rejected: at capacity ({self.max_sessions})"
)
return False
self._sessions[session_id] = SessionInfo(session_id=session_id)
logger.info(
f"Session {session_id[:8]} registered "
f"({len(self._sessions)}/{self.max_sessions})"
)
return True
async def unregister(self, session_id: str):
"""Remove a session."""
async with self._lock:
if session_id in self._sessions:
info = self._sessions.pop(session_id)
duration = (datetime.utcnow() - info.connected_at).seconds
logger.info(
f"Session {session_id[:8]} ended "
f"(duration={duration}s, messages={info.message_count})"
)
async def get_stats(self) -> dict:
"""Get session statistics for monitoring."""
async with self._lock:
return {
"active_sessions": len(self._sessions),
"max_sessions": self.max_sessions,
"sessions": [
{
"id": s.session_id[:8] + "...",
"duration_seconds": (
datetime.utcnow() - s.connected_at
).seconds,
"messages": s.message_count
}
for s in self._sessions.values()
]
}
# Global session manager
session_manager = SessionManager()
Monitoring and Cost Tracking
# app/monitoring.py
"""Production monitoring and cost tracking."""
import logging
import time
from dataclasses import dataclass, field
from typing import Dict
logger = logging.getLogger(__name__)
@dataclass
class UsageMetrics:
"""Track API usage and costs."""
# Whisper ASR
whisper_requests: int = 0
whisper_audio_seconds: float = 0.0
whisper_cost: float = 0.0
# LLM
llm_requests: int = 0
llm_input_tokens: int = 0
llm_output_tokens: int = 0
llm_cost: float = 0.0
# TTS
tts_requests: int = 0
tts_characters: int = 0
tts_cost: float = 0.0
@property
def total_cost(self) -> float:
return self.whisper_cost + self.llm_cost + self.tts_cost
def record_whisper(self, audio_seconds: float):
self.whisper_requests += 1
self.whisper_audio_seconds += audio_seconds
self.whisper_cost += audio_seconds * (0.006 / 60) # $0.006/min
def record_llm(self, input_tokens: int, output_tokens: int):
self.llm_requests += 1
self.llm_input_tokens += input_tokens
self.llm_output_tokens += output_tokens
# GPT-4o pricing
self.llm_cost += (input_tokens * 2.50 + output_tokens * 10.0) / 1_000_000
def record_tts(self, characters: int, provider: str = "elevenlabs"):
self.tts_requests += 1
self.tts_characters += characters
if provider == "openai":
self.tts_cost += characters * (15.0 / 1_000_000)
# ElevenLabs free tier covers 10k chars/month
def get_summary(self) -> dict:
return {
"whisper": {
"requests": self.whisper_requests,
"audio_minutes": round(self.whisper_audio_seconds / 60, 2),
"cost_usd": round(self.whisper_cost, 4)
},
"llm": {
"requests": self.llm_requests,
"input_tokens": self.llm_input_tokens,
"output_tokens": self.llm_output_tokens,
"cost_usd": round(self.llm_cost, 4)
},
"tts": {
"requests": self.tts_requests,
"characters": self.tts_characters,
"cost_usd": round(self.tts_cost, 4)
},
"total_cost_usd": round(self.total_cost, 4)
}
# Global metrics instance
metrics = UsageMetrics()
# Add monitoring endpoints to FastAPI:
#
# @app.get("/metrics")
# async def get_metrics():
# return {
# "usage": metrics.get_summary(),
# "sessions": await session_manager.get_stats()
# }
Deploy and Verify
# Build and start the production stack
docker-compose build
docker-compose up -d
# Check logs
docker-compose logs -f voice-assistant
# Verify health
curl https://your-domain.com/health
# Check metrics
curl https://your-domain.com/metrics
# Run a quick test conversation through the browser
# Open https://your-domain.com and test the voice pipeline
Production Checklist
- SSL/TLS: WebSocket audio streaming requires HTTPS (wss://) in production browsers. Use Let's Encrypt for free certificates.
- Rate Limiting: Limit WebSocket connections per IP to prevent abuse. The session manager already caps total sessions.
- API Key Security: Never expose API keys to the frontend. All API calls go through the server.
- Error Recovery: The WebSocket handler catches errors and the client auto-reconnects on disconnect.
- Logging: JSON-structured logs with request IDs for tracing through the ASR → LLM → TTS pipeline.
- Cost Alerts: Set up alerts when daily API costs exceed your budget threshold.
Checkpoint: Your voice assistant is now running in Docker with nginx reverse proxy, SSL termination, session management, latency optimization, and cost monitoring. The production stack handles concurrent users and recovers from errors gracefully.
Key Takeaways
- Use 1 uvicorn worker per container for WebSocket statefulness; scale horizontally with multiple containers.
- Nginx must be configured with
proxy_set_header Upgradeand long timeouts for WebSocket proxying. - Sentence-level streaming overlap (generating sentence N+1 while playing sentence N) is the biggest latency win.
- Track costs per API call: a typical conversation costs about $0.01 across all three stages.
What Is Next
In the final lesson, you will explore enhancements and advanced patterns — wake word detection, multi-language support, telephony integration, and a comprehensive FAQ for voice assistant builders.
Lilly Tech Systems