Advanced

Step 5: Deploy to Production

Ship your voice assistant. You will containerize the entire application with Docker, configure production settings, optimize end-to-end latency, manage concurrent WebSocket sessions, and set up monitoring and cost tracking.

Production Docker Setup

Create a production-ready Dockerfile with multi-stage build for smaller image size:

# Dockerfile
FROM python:3.11-slim as builder

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt


FROM python:3.11-slim

WORKDIR /app

# Copy installed packages from builder
COPY --from=builder /install /usr/local

# Copy application code
COPY app/ ./app/
COPY frontend/ ./frontend/

# Create non-root user
RUN useradd --create-home appuser
USER appuser

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD python -c "import httpx; r = httpx.get('http://localhost:8000/health'); assert r.status_code == 200"

# Run with production settings
CMD ["uvicorn", "app.main:app", \
     "--host", "0.0.0.0", \
     "--port", "8000", \
     "--workers", "1", \
     "--ws-max-size", "16777216", \
     "--timeout-keep-alive", "120"]
💡
Why 1 Worker? WebSocket connections are stateful and tied to a specific worker process. With multiple workers, a client might reconnect to a different worker and lose their conversation state. For voice assistants, use 1 worker per container and scale horizontally with multiple containers behind a load balancer that supports sticky sessions.

Docker Compose for Production

# docker-compose.yml
version: "3.8"

services:
  voice-assistant:
    build: .
    container_name: voice-assistant
    ports:
      - "8000:8000"
    env_file:
      - .env
    volumes:
      - ./frontend:/app/frontend:ro
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: "1.0"
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

  # Reverse proxy with SSL termination
  nginx:
    image: nginx:alpine
    container_name: voice-nginx
    ports:
      - "443:443"
      - "80:80"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./nginx/certs:/etc/nginx/certs:ro
    depends_on:
      - voice-assistant
    restart: unless-stopped

Nginx Configuration for WebSockets

WebSocket connections require special nginx configuration for proxying:

# nginx/nginx.conf
events {
    worker_connections 1024;
}

http {
    # WebSocket upgrade map
    map $http_upgrade $connection_upgrade {
        default upgrade;
        ''      close;
    }

    upstream voice_backend {
        server voice-assistant:8000;
    }

    server {
        listen 80;
        server_name your-domain.com;
        return 301 https://$host$request_uri;
    }

    server {
        listen 443 ssl;
        server_name your-domain.com;

        ssl_certificate     /etc/nginx/certs/fullchain.pem;
        ssl_certificate_key /etc/nginx/certs/privkey.pem;

        # WebSocket endpoint
        location /ws/ {
            proxy_pass http://voice_backend;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection $connection_upgrade;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;

            # Timeouts for long-lived WebSocket connections
            proxy_read_timeout 3600s;
            proxy_send_timeout 3600s;

            # Increase buffer sizes for audio data
            proxy_buffering off;
            proxy_buffer_size 16k;
        }

        # HTTP endpoints
        location / {
            proxy_pass http://voice_backend;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }
    }
}

Latency Optimization

Voice assistants live and die by latency. Here is a breakdown of where time is spent and how to optimize each stage:

# Latency Budget (target: under 2 seconds total)
#
# Stage              | Typical   | Optimized | Technique
# -------------------|-----------|-----------|---------------------------
# Audio capture      | 800ms     | 500ms     | Reduce silence threshold
# Network (upload)   | 50ms      | 50ms      | Binary WebSocket frames
# Whisper ASR        | 500ms     | 300ms     | Use whisper-1 API, short audio
# Network (API)      | 100ms     | 50ms      | Keep-alive connections
# LLM (first token)  | 500ms     | 300ms     | GPT-4o-mini, streaming
# TTS (first chunk)  | 400ms     | 200ms     | ElevenLabs Turbo, streaming
# Network (download) | 50ms      | 50ms      | Binary WebSocket frames
# Audio playback     | 0ms       | 0ms       | Starts immediately
# -------------------|-----------|-----------|---------------------------
# TOTAL              | 2400ms    | 1450ms    |

Key optimization strategies:

# app/optimizations.py
"""Latency optimization utilities."""
import asyncio
from typing import AsyncGenerator


async def pipeline_with_overlap(
    llm_generator: AsyncGenerator[str, None],
    tts_speak_func,
    websocket
) -> None:
    """Run LLM generation and TTS synthesis with overlap.

    Instead of waiting for the full LLM response before
    starting TTS, we buffer tokens into sentences and
    start TTS as soon as the first sentence is complete.

    This overlaps LLM generation of sentence N+1 with
    TTS synthesis and playback of sentence N.
    """
    sentence_buffer = ""
    sentence_enders = {'.', '!', '?'}

    async for token in llm_generator:
        sentence_buffer += token

        # Send text to client for display
        await websocket.send_json({
            "type": "response_text",
            "text": token
        })

        # Check if sentence is complete
        if sentence_buffer.strip() and sentence_buffer.strip()[-1] in sentence_enders:
            sentence = sentence_buffer.strip()
            sentence_buffer = ""

            # Synthesize and stream audio for this sentence
            async for audio_chunk in tts_speak_func(sentence):
                await websocket.send_bytes(audio_chunk)

    # Handle remaining text
    if sentence_buffer.strip():
        async for audio_chunk in tts_speak_func(sentence_buffer.strip()):
            await websocket.send_bytes(audio_chunk)


class ConnectionPool:
    """Reusable HTTP connection pool for API calls.

    Creating a new HTTPS connection for every API call adds
    ~100ms of TLS handshake time. Reusing connections eliminates
    this overhead.
    """

    _instance = None

    def __new__(cls):
        if cls._instance is None:
            import httpx
            cls._instance = super().__new__(cls)
            cls._instance.client = httpx.AsyncClient(
                timeout=30.0,
                limits=httpx.Limits(
                    max_connections=20,
                    max_keepalive_connections=10,
                    keepalive_expiry=300
                ),
                http2=True  # HTTP/2 for multiplexing
            )
        return cls._instance

    async def close(self):
        await self.client.aclose()

Concurrent Session Management

# app/session_manager.py
"""Manage concurrent voice sessions."""
import logging
from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict
import asyncio

logger = logging.getLogger(__name__)

# Maximum concurrent sessions
MAX_SESSIONS = 50


@dataclass
class SessionInfo:
    """Track information about an active session."""
    session_id: str
    connected_at: datetime = field(default_factory=datetime.utcnow)
    message_count: int = 0
    total_audio_bytes: int = 0


class SessionManager:
    """Manage concurrent voice WebSocket sessions.

    Tracks active sessions, enforces limits, and provides
    metrics for monitoring.
    """

    def __init__(self, max_sessions: int = MAX_SESSIONS):
        self.max_sessions = max_sessions
        self._sessions: Dict[str, SessionInfo] = {}
        self._lock = asyncio.Lock()

    async def register(self, session_id: str) -> bool:
        """Register a new session. Returns False if at capacity."""
        async with self._lock:
            if len(self._sessions) >= self.max_sessions:
                logger.warning(
                    f"Session rejected: at capacity ({self.max_sessions})"
                )
                return False

            self._sessions[session_id] = SessionInfo(session_id=session_id)
            logger.info(
                f"Session {session_id[:8]} registered "
                f"({len(self._sessions)}/{self.max_sessions})"
            )
            return True

    async def unregister(self, session_id: str):
        """Remove a session."""
        async with self._lock:
            if session_id in self._sessions:
                info = self._sessions.pop(session_id)
                duration = (datetime.utcnow() - info.connected_at).seconds
                logger.info(
                    f"Session {session_id[:8]} ended "
                    f"(duration={duration}s, messages={info.message_count})"
                )

    async def get_stats(self) -> dict:
        """Get session statistics for monitoring."""
        async with self._lock:
            return {
                "active_sessions": len(self._sessions),
                "max_sessions": self.max_sessions,
                "sessions": [
                    {
                        "id": s.session_id[:8] + "...",
                        "duration_seconds": (
                            datetime.utcnow() - s.connected_at
                        ).seconds,
                        "messages": s.message_count
                    }
                    for s in self._sessions.values()
                ]
            }


# Global session manager
session_manager = SessionManager()

Monitoring and Cost Tracking

# app/monitoring.py
"""Production monitoring and cost tracking."""
import logging
import time
from dataclasses import dataclass, field
from typing import Dict

logger = logging.getLogger(__name__)


@dataclass
class UsageMetrics:
    """Track API usage and costs."""

    # Whisper ASR
    whisper_requests: int = 0
    whisper_audio_seconds: float = 0.0
    whisper_cost: float = 0.0

    # LLM
    llm_requests: int = 0
    llm_input_tokens: int = 0
    llm_output_tokens: int = 0
    llm_cost: float = 0.0

    # TTS
    tts_requests: int = 0
    tts_characters: int = 0
    tts_cost: float = 0.0

    @property
    def total_cost(self) -> float:
        return self.whisper_cost + self.llm_cost + self.tts_cost

    def record_whisper(self, audio_seconds: float):
        self.whisper_requests += 1
        self.whisper_audio_seconds += audio_seconds
        self.whisper_cost += audio_seconds * (0.006 / 60)  # $0.006/min

    def record_llm(self, input_tokens: int, output_tokens: int):
        self.llm_requests += 1
        self.llm_input_tokens += input_tokens
        self.llm_output_tokens += output_tokens
        # GPT-4o pricing
        self.llm_cost += (input_tokens * 2.50 + output_tokens * 10.0) / 1_000_000

    def record_tts(self, characters: int, provider: str = "elevenlabs"):
        self.tts_requests += 1
        self.tts_characters += characters
        if provider == "openai":
            self.tts_cost += characters * (15.0 / 1_000_000)
        # ElevenLabs free tier covers 10k chars/month

    def get_summary(self) -> dict:
        return {
            "whisper": {
                "requests": self.whisper_requests,
                "audio_minutes": round(self.whisper_audio_seconds / 60, 2),
                "cost_usd": round(self.whisper_cost, 4)
            },
            "llm": {
                "requests": self.llm_requests,
                "input_tokens": self.llm_input_tokens,
                "output_tokens": self.llm_output_tokens,
                "cost_usd": round(self.llm_cost, 4)
            },
            "tts": {
                "requests": self.tts_requests,
                "characters": self.tts_characters,
                "cost_usd": round(self.tts_cost, 4)
            },
            "total_cost_usd": round(self.total_cost, 4)
        }


# Global metrics instance
metrics = UsageMetrics()


# Add monitoring endpoints to FastAPI:
#
# @app.get("/metrics")
# async def get_metrics():
#     return {
#         "usage": metrics.get_summary(),
#         "sessions": await session_manager.get_stats()
#     }

Deploy and Verify

# Build and start the production stack
docker-compose build
docker-compose up -d

# Check logs
docker-compose logs -f voice-assistant

# Verify health
curl https://your-domain.com/health

# Check metrics
curl https://your-domain.com/metrics

# Run a quick test conversation through the browser
# Open https://your-domain.com and test the voice pipeline

Production Checklist

  • SSL/TLS: WebSocket audio streaming requires HTTPS (wss://) in production browsers. Use Let's Encrypt for free certificates.
  • Rate Limiting: Limit WebSocket connections per IP to prevent abuse. The session manager already caps total sessions.
  • API Key Security: Never expose API keys to the frontend. All API calls go through the server.
  • Error Recovery: The WebSocket handler catches errors and the client auto-reconnects on disconnect.
  • Logging: JSON-structured logs with request IDs for tracing through the ASR → LLM → TTS pipeline.
  • Cost Alerts: Set up alerts when daily API costs exceed your budget threshold.
📝
Checkpoint: Your voice assistant is now running in Docker with nginx reverse proxy, SSL termination, session management, latency optimization, and cost monitoring. The production stack handles concurrent users and recovers from errors gracefully.

Key Takeaways

  • Use 1 uvicorn worker per container for WebSocket statefulness; scale horizontally with multiple containers.
  • Nginx must be configured with proxy_set_header Upgrade and long timeouts for WebSocket proxying.
  • Sentence-level streaming overlap (generating sentence N+1 while playing sentence N) is the biggest latency win.
  • Track costs per API call: a typical conversation costs about $0.01 across all three stages.

What Is Next

In the final lesson, you will explore enhancements and advanced patterns — wake word detection, multi-language support, telephony integration, and a comprehensive FAQ for voice assistant builders.