Advanced

Step 6: Deploy to Production

Your RAG chatbot works locally. Now let us deploy it properly with Docker, secure environment management, health checks, request logging, and cost tracking. By the end of this lesson, you can run the entire stack with a single command on any server.

Production Docker Compose

Update docker-compose.yml with production-ready settings including health checks, resource limits, and a reverse proxy:

# docker-compose.yml (production)
version: "3.8"

services:
  qdrant:
    image: qdrant/qdrant:v1.12.4
    container_name: rag-qdrant
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      QDRANT__SERVICE__GRPC_PORT: 6334
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 10s
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: "1.0"
    restart: unless-stopped

  api:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: rag-api
    ports:
      - "8000:8000"
    env_file:
      - .env
    environment:
      QDRANT_HOST: qdrant
      QDRANT_PORT: 6333
      LOG_LEVEL: INFO
    depends_on:
      qdrant:
        condition: service_healthy
    volumes:
      - ./data:/app/data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s
    deploy:
      resources:
        limits:
          memory: 1G
          cpus: "1.0"
    restart: unless-stopped

  nginx:
    image: nginx:alpine
    container_name: rag-nginx
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf
    depends_on:
      - api
    restart: unless-stopped

volumes:
  qdrant_data:

Nginx Reverse Proxy

Create nginx.conf for SSL termination, rate limiting, and proper SSE proxying:

# nginx.conf
upstream api {
    server api:8000;
}

# Rate limiting zone: 10 requests per second per IP
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

server {
    listen 80;
    server_name your-domain.com;

    # Security headers
    add_header X-Content-Type-Options nosniff;
    add_header X-Frame-Options DENY;
    add_header X-XSS-Protection "1; mode=block";

    # Max upload size for document ingestion
    client_max_body_size 50M;

    # API endpoints
    location /api/ {
        limit_req zone=api_limit burst=20 nodelay;

        proxy_pass http://api;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # SSE support - critical for streaming
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 300s;
    }

    # Health check (no rate limit)
    location /health {
        proxy_pass http://api;
    }

    # Frontend
    location / {
        proxy_pass http://api;
    }
}

📝

SSE and proxy_buffering: The proxy_buffering off directive is critical. Without it, Nginx buffers the streaming response and delivers all tokens at once, which defeats the purpose of streaming. The proxy_read_timeout 300s prevents Nginx from closing long-running streaming connections.

Production Dockerfile

Update the Dockerfile with multi-stage build and a non-root user:

# Dockerfile (production)
FROM python:3.11-slim AS builder

WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

FROM python:3.11-slim

# Create non-root user
RUN groupadd -r appuser && useradd -r -g appuser appuser

WORKDIR /app

# Copy installed packages from builder
COPY --from=builder /install /usr/local

# Install runtime system deps only
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy application
COPY app/ ./app/
COPY frontend/ ./frontend/

# Set ownership
RUN chown -R appuser:appuser /app

USER appuser

EXPOSE 8000

# Production command with multiple workers
CMD ["uvicorn", "app.main:app", \
     "--host", "0.0.0.0", \
     "--port", "8000", \
     "--workers", "2", \
     "--loop", "uvloop", \
     "--http", "httptools"]

Request Logging and Monitoring

Add middleware to log every request with timing and cost tracking:

# Add to app/main.py - request logging middleware
import time as time_module
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request


class RequestLoggingMiddleware(BaseHTTPMiddleware):
    """Log every request with timing information."""

    async def dispatch(self, request: Request, call_next):
        start = time_module.time()
        response = await call_next(request)
        elapsed = time_module.time() - start

        logger.info(
            f"{request.method} {request.url.path} "
            f"- {response.status_code} "
            f"- {elapsed:.3f}s "
            f"- {request.client.host}"
        )

        # Add timing header
        response.headers["X-Response-Time"] = f"{elapsed:.3f}s"
        return response


# Add before routes
app.add_middleware(RequestLoggingMiddleware)

Cost Tracking

Track OpenAI API costs per query to avoid surprise bills:

# app/cost_tracker.py
"""Track OpenAI API costs per query."""
import logging
from dataclasses import dataclass, field
from datetime import datetime

logger = logging.getLogger(__name__)

# Pricing per 1M tokens (as of 2025)
PRICING = {
    "text-embedding-3-small": {"input": 0.02},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "gpt-4o": {"input": 2.50, "output": 10.00},
}


@dataclass
class QueryCost:
    """Cost breakdown for a single query."""
    embedding_tokens: int = 0
    chat_input_tokens: int = 0
    chat_output_tokens: int = 0
    embedding_cost: float = 0.0
    chat_cost: float = 0.0
    total_cost: float = 0.0
    timestamp: str = ""


class CostTracker:
    """Track cumulative API costs."""

    def __init__(self):
        self.total_cost: float = 0.0
        self.total_queries: int = 0
        self.query_log: list[QueryCost] = []

    def track_query(
        self,
        embedding_tokens: int,
        chat_input_tokens: int,
        chat_output_tokens: int,
        embedding_model: str = "text-embedding-3-small",
        chat_model: str = "gpt-4o-mini",
    ) -> QueryCost:
        """Record the cost of a query.

        Args:
            embedding_tokens: Tokens used for embeddings.
            chat_input_tokens: Input tokens for chat completion.
            chat_output_tokens: Output tokens for chat completion.
            embedding_model: The embedding model used.
            chat_model: The chat model used.

        Returns:
            QueryCost with the breakdown.
        """
        embed_price = PRICING.get(embedding_model, {})
        chat_price = PRICING.get(chat_model, {})

        embedding_cost = (embedding_tokens / 1_000_000) * embed_price.get("input", 0)
        chat_input_cost = (chat_input_tokens / 1_000_000) * chat_price.get("input", 0)
        chat_output_cost = (chat_output_tokens / 1_000_000) * chat_price.get("output", 0)

        total = embedding_cost + chat_input_cost + chat_output_cost

        cost = QueryCost(
            embedding_tokens=embedding_tokens,
            chat_input_tokens=chat_input_tokens,
            chat_output_tokens=chat_output_tokens,
            embedding_cost=round(embedding_cost, 6),
            chat_cost=round(chat_input_cost + chat_output_cost, 6),
            total_cost=round(total, 6),
            timestamp=datetime.utcnow().isoformat(),
        )

        self.total_cost += total
        self.total_queries += 1
        self.query_log.append(cost)

        logger.info(
            f"Query cost: ${total:.6f} "
            f"(embed: {embedding_tokens} tokens, "
            f"chat: {chat_input_tokens}+{chat_output_tokens} tokens)"
        )

        return cost

    def get_summary(self) -> dict:
        """Get cost summary."""
        return {
            "total_cost_usd": round(self.total_cost, 4),
            "total_queries": self.total_queries,
            "avg_cost_per_query": round(
                self.total_cost / max(self.total_queries, 1), 6
            ),
            "last_10_queries": [
                {
                    "total_cost": q.total_cost,
                    "timestamp": q.timestamp,
                }
                for q in self.query_log[-10:]
            ],
        }


# Global tracker instance
cost_tracker = CostTracker()

Add a cost tracking endpoint:

# Add to app/main.py
from app.cost_tracker import cost_tracker


@app.get("/api/costs")
async def get_costs():
    """Get API cost summary."""
    return cost_tracker.get_summary()

Environment Variable Security

Create a .env.production template with notes on secure handling:

# .env.production - NEVER commit this file
# Use a secrets manager (AWS Secrets Manager, Vault, etc.) in production

# OpenAI - use a dedicated API key with spending limits
OPENAI_API_KEY=sk-prod-your-production-key
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
OPENAI_CHAT_MODEL=gpt-4o-mini

# Qdrant - internal network only
QDRANT_HOST=qdrant
QDRANT_PORT=6333
QDRANT_COLLECTION=rag_chatbot_docs

# Retrieval
CHUNK_SIZE=512
CHUNK_OVERLAP=50
TOP_K=5

# Logging
LOG_LEVEL=WARNING

💡

Set spending limits: Go to OpenAI's dashboard and set a monthly spending limit. For a low-traffic chatbot, $10/month is usually enough. The cost tracker helps you monitor actual spend per query so you can adjust the limit based on real usage.

Deploy with One Command

# Build and start all services
docker-compose up -d --build

# Verify all services are healthy
docker-compose ps
# NAME          STATUS          PORTS
# rag-qdrant    Up (healthy)    0.0.0.0:6333->6333/tcp
# rag-api       Up (healthy)    0.0.0.0:8000->8000/tcp
# rag-nginx     Up              0.0.0.0:80->80/tcp

# Check logs
docker-compose logs -f api

# Ingest documents
curl -X POST http://localhost/api/ingest?directory=data/sample

# Open the chatbot
# http://localhost (or your domain)

Production Checklist

Item	Status	Notes
Docker containers with health checks	Done	30s interval, 3 retries
Non-root user in container	Done	appuser in Dockerfile
Resource limits (CPU, memory)	Done	1 CPU, 1-2GB RAM per service
Reverse proxy with rate limiting	Done	10 req/s per IP
SSE proxy configuration	Done	proxy_buffering off
Request logging with timing	Done	Middleware logs every request
Cost tracking per query	Done	/api/costs endpoint
Environment variable security	Done	.env never committed
Persistent Qdrant storage	Done	Docker volume
Automatic restart on failure	Done	restart: unless-stopped

Monthly Cost Estimate

Component	100 queries/day	1,000 queries/day
OpenAI Embeddings	$0.06/mo	$0.60/mo
OpenAI Chat (gpt-4o-mini)	$0.45/mo	$4.50/mo
Qdrant (self-hosted)	$0 (Docker)	$0 (Docker)
Server (VPS)	$5-10/mo	$10-20/mo
Total	$5-11/mo	$15-25/mo

Key Takeaways

Docker Compose deploys the entire stack (Qdrant, API, Nginx) with a single docker-compose up -d command.
Health checks ensure services restart automatically if they crash or become unresponsive.
Nginx handles rate limiting, SSE proxying, and security headers in front of FastAPI.
Cost tracking at the query level prevents surprise OpenAI bills and helps optimize the pipeline.
The total cost for a low-traffic RAG chatbot is under $10/month including server hosting.

What Is Next

Your RAG chatbot is deployed and running in production. In the final lesson, you will learn about enhancements and next steps: multi-tenant support, authentication, analytics, and advanced patterns to take your chatbot further.

← Previous Chat UI Next → Enhancements & Next Steps