Step 6: Deploy to Production
Your RAG chatbot works locally. Now let us deploy it properly with Docker, secure environment management, health checks, request logging, and cost tracking. By the end of this lesson, you can run the entire stack with a single command on any server.
Production Docker Compose
Update docker-compose.yml with production-ready settings including health checks, resource limits, and a reverse proxy:
# docker-compose.yml (production)
version: "3.8"
services:
qdrant:
image: qdrant/qdrant:v1.12.4
container_name: rag-qdrant
ports:
- "6333:6333"
volumes:
- qdrant_data:/qdrant/storage
environment:
QDRANT__SERVICE__GRPC_PORT: 6334
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
deploy:
resources:
limits:
memory: 2G
cpus: "1.0"
restart: unless-stopped
api:
build:
context: .
dockerfile: Dockerfile
container_name: rag-api
ports:
- "8000:8000"
env_file:
- .env
environment:
QDRANT_HOST: qdrant
QDRANT_PORT: 6333
LOG_LEVEL: INFO
depends_on:
qdrant:
condition: service_healthy
volumes:
- ./data:/app/data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
deploy:
resources:
limits:
memory: 1G
cpus: "1.0"
restart: unless-stopped
nginx:
image: nginx:alpine
container_name: rag-nginx
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf
depends_on:
- api
restart: unless-stopped
volumes:
qdrant_data:
Nginx Reverse Proxy
Create nginx.conf for SSL termination, rate limiting, and proper SSE proxying:
# nginx.conf
upstream api {
server api:8000;
}
# Rate limiting zone: 10 requests per second per IP
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
server {
listen 80;
server_name your-domain.com;
# Security headers
add_header X-Content-Type-Options nosniff;
add_header X-Frame-Options DENY;
add_header X-XSS-Protection "1; mode=block";
# Max upload size for document ingestion
client_max_body_size 50M;
# API endpoints
location /api/ {
limit_req zone=api_limit burst=20 nodelay;
proxy_pass http://api;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# SSE support - critical for streaming
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 300s;
}
# Health check (no rate limit)
location /health {
proxy_pass http://api;
}
# Frontend
location / {
proxy_pass http://api;
}
}
proxy_buffering off directive is critical. Without it, Nginx buffers the streaming response and delivers all tokens at once, which defeats the purpose of streaming. The proxy_read_timeout 300s prevents Nginx from closing long-running streaming connections.Production Dockerfile
Update the Dockerfile with multi-stage build and a non-root user:
# Dockerfile (production)
FROM python:3.11-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
FROM python:3.11-slim
# Create non-root user
RUN groupadd -r appuser && useradd -r -g appuser appuser
WORKDIR /app
# Copy installed packages from builder
COPY --from=builder /install /usr/local
# Install runtime system deps only
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
&& rm -rf /var/lib/apt/lists/*
# Copy application
COPY app/ ./app/
COPY frontend/ ./frontend/
# Set ownership
RUN chown -R appuser:appuser /app
USER appuser
EXPOSE 8000
# Production command with multiple workers
CMD ["uvicorn", "app.main:app", \
"--host", "0.0.0.0", \
"--port", "8000", \
"--workers", "2", \
"--loop", "uvloop", \
"--http", "httptools"]
Request Logging and Monitoring
Add middleware to log every request with timing and cost tracking:
# Add to app/main.py - request logging middleware
import time as time_module
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
class RequestLoggingMiddleware(BaseHTTPMiddleware):
"""Log every request with timing information."""
async def dispatch(self, request: Request, call_next):
start = time_module.time()
response = await call_next(request)
elapsed = time_module.time() - start
logger.info(
f"{request.method} {request.url.path} "
f"- {response.status_code} "
f"- {elapsed:.3f}s "
f"- {request.client.host}"
)
# Add timing header
response.headers["X-Response-Time"] = f"{elapsed:.3f}s"
return response
# Add before routes
app.add_middleware(RequestLoggingMiddleware)
Cost Tracking
Track OpenAI API costs per query to avoid surprise bills:
# app/cost_tracker.py
"""Track OpenAI API costs per query."""
import logging
from dataclasses import dataclass, field
from datetime import datetime
logger = logging.getLogger(__name__)
# Pricing per 1M tokens (as of 2025)
PRICING = {
"text-embedding-3-small": {"input": 0.02},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"gpt-4o": {"input": 2.50, "output": 10.00},
}
@dataclass
class QueryCost:
"""Cost breakdown for a single query."""
embedding_tokens: int = 0
chat_input_tokens: int = 0
chat_output_tokens: int = 0
embedding_cost: float = 0.0
chat_cost: float = 0.0
total_cost: float = 0.0
timestamp: str = ""
class CostTracker:
"""Track cumulative API costs."""
def __init__(self):
self.total_cost: float = 0.0
self.total_queries: int = 0
self.query_log: list[QueryCost] = []
def track_query(
self,
embedding_tokens: int,
chat_input_tokens: int,
chat_output_tokens: int,
embedding_model: str = "text-embedding-3-small",
chat_model: str = "gpt-4o-mini",
) -> QueryCost:
"""Record the cost of a query.
Args:
embedding_tokens: Tokens used for embeddings.
chat_input_tokens: Input tokens for chat completion.
chat_output_tokens: Output tokens for chat completion.
embedding_model: The embedding model used.
chat_model: The chat model used.
Returns:
QueryCost with the breakdown.
"""
embed_price = PRICING.get(embedding_model, {})
chat_price = PRICING.get(chat_model, {})
embedding_cost = (embedding_tokens / 1_000_000) * embed_price.get("input", 0)
chat_input_cost = (chat_input_tokens / 1_000_000) * chat_price.get("input", 0)
chat_output_cost = (chat_output_tokens / 1_000_000) * chat_price.get("output", 0)
total = embedding_cost + chat_input_cost + chat_output_cost
cost = QueryCost(
embedding_tokens=embedding_tokens,
chat_input_tokens=chat_input_tokens,
chat_output_tokens=chat_output_tokens,
embedding_cost=round(embedding_cost, 6),
chat_cost=round(chat_input_cost + chat_output_cost, 6),
total_cost=round(total, 6),
timestamp=datetime.utcnow().isoformat(),
)
self.total_cost += total
self.total_queries += 1
self.query_log.append(cost)
logger.info(
f"Query cost: ${total:.6f} "
f"(embed: {embedding_tokens} tokens, "
f"chat: {chat_input_tokens}+{chat_output_tokens} tokens)"
)
return cost
def get_summary(self) -> dict:
"""Get cost summary."""
return {
"total_cost_usd": round(self.total_cost, 4),
"total_queries": self.total_queries,
"avg_cost_per_query": round(
self.total_cost / max(self.total_queries, 1), 6
),
"last_10_queries": [
{
"total_cost": q.total_cost,
"timestamp": q.timestamp,
}
for q in self.query_log[-10:]
],
}
# Global tracker instance
cost_tracker = CostTracker()
Add a cost tracking endpoint:
# Add to app/main.py
from app.cost_tracker import cost_tracker
@app.get("/api/costs")
async def get_costs():
"""Get API cost summary."""
return cost_tracker.get_summary()
Environment Variable Security
Create a .env.production template with notes on secure handling:
# .env.production - NEVER commit this file
# Use a secrets manager (AWS Secrets Manager, Vault, etc.) in production
# OpenAI - use a dedicated API key with spending limits
OPENAI_API_KEY=sk-prod-your-production-key
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
OPENAI_CHAT_MODEL=gpt-4o-mini
# Qdrant - internal network only
QDRANT_HOST=qdrant
QDRANT_PORT=6333
QDRANT_COLLECTION=rag_chatbot_docs
# Retrieval
CHUNK_SIZE=512
CHUNK_OVERLAP=50
TOP_K=5
# Logging
LOG_LEVEL=WARNING
Deploy with One Command
# Build and start all services
docker-compose up -d --build
# Verify all services are healthy
docker-compose ps
# NAME STATUS PORTS
# rag-qdrant Up (healthy) 0.0.0.0:6333->6333/tcp
# rag-api Up (healthy) 0.0.0.0:8000->8000/tcp
# rag-nginx Up 0.0.0.0:80->80/tcp
# Check logs
docker-compose logs -f api
# Ingest documents
curl -X POST http://localhost/api/ingest?directory=data/sample
# Open the chatbot
# http://localhost (or your domain)
Production Checklist
| Item | Status | Notes |
|---|---|---|
| Docker containers with health checks | Done | 30s interval, 3 retries |
| Non-root user in container | Done | appuser in Dockerfile |
| Resource limits (CPU, memory) | Done | 1 CPU, 1-2GB RAM per service |
| Reverse proxy with rate limiting | Done | 10 req/s per IP |
| SSE proxy configuration | Done | proxy_buffering off |
| Request logging with timing | Done | Middleware logs every request |
| Cost tracking per query | Done | /api/costs endpoint |
| Environment variable security | Done | .env never committed |
| Persistent Qdrant storage | Done | Docker volume |
| Automatic restart on failure | Done | restart: unless-stopped |
Monthly Cost Estimate
| Component | 100 queries/day | 1,000 queries/day |
|---|---|---|
| OpenAI Embeddings | $0.06/mo | $0.60/mo |
| OpenAI Chat (gpt-4o-mini) | $0.45/mo | $4.50/mo |
| Qdrant (self-hosted) | $0 (Docker) | $0 (Docker) |
| Server (VPS) | $5-10/mo | $10-20/mo |
| Total | $5-11/mo | $15-25/mo |
Key Takeaways
- Docker Compose deploys the entire stack (Qdrant, API, Nginx) with a single
docker-compose up -dcommand. - Health checks ensure services restart automatically if they crash or become unresponsive.
- Nginx handles rate limiting, SSE proxying, and security headers in front of FastAPI.
- Cost tracking at the query level prevents surprise OpenAI bills and helps optimize the pipeline.
- The total cost for a low-traffic RAG chatbot is under $10/month including server hosting.
What Is Next
Your RAG chatbot is deployed and running in production. In the final lesson, you will learn about enhancements and next steps: multi-tenant support, authentication, analytics, and advanced patterns to take your chatbot further.
Lilly Tech Systems