Advanced

Best Practices & Checklist

This final lesson compiles everything from the course into a production-ready checklist, documents the most common failure modes, and provides debugging techniques for when things go wrong. Bookmark this page — you will reference it every time you ship an LLM feature.

Production LLM Checklist

Use this checklist before launching any LLM-powered feature. Items are ordered by priority — address the top items first.

Security & Safety (Must-Have)

  • Prompt injection protection: Pattern-based + LLM-based detection on all user inputs
  • PII redaction: Redact emails, SSNs, credit cards, phone numbers before sending to LLM API
  • Output content policy: Validate responses against your content policy before showing to users
  • System prompt protection: Never put secrets, API keys, or sensitive logic in system prompts
  • Rate limiting: Per-user and per-API-key rate limits to prevent abuse
  • Input length limits: Cap input size to prevent token abuse and cost spikes

Reliability (Must-Have)

  • Fallback providers: Configure at least one fallback LLM provider (e.g., Anthropic if primary is OpenAI)
  • Timeout handling: Set timeouts on all LLM calls (typically 30-60 seconds)
  • Retry with backoff: Retry transient failures (429, 500, 503) with exponential backoff
  • Graceful degradation: Define fallback behavior when all LLM providers are down
  • Error handling: Never expose raw LLM errors to users. Map to user-friendly messages.
  • Output format validation: Parse and validate structured outputs (JSON) before using them

Quality (Should-Have)

  • Evaluation suite: At least 10-20 test cases covering critical flows
  • Regression testing: Run evaluation suite on every prompt change
  • Prompt versioning: Version control all prompts with rollback capability
  • A/B testing: Test prompt changes on a small traffic percentage before full rollout
  • Human evaluation: Periodic human review of sampled responses (weekly or bi-weekly)

Cost & Observability (Should-Have)

  • Cost tracking: Per-request cost logging with team, feature, and model breakdowns
  • Budget alerts: Alerts when daily or weekly spend exceeds thresholds
  • Model routing: Route simple tasks to cheap models
  • Semantic caching: Cache responses for semantically similar queries
  • Logging: Log prompts, responses, latency, token counts, and costs (with PII redacted)
  • Monitoring dashboard: Track latency, error rates, cost, and quality metrics

User Experience (Nice-to-Have)

  • Streaming responses: Stream long responses to reduce perceived latency
  • Conversation memory: Maintain context across messages
  • Feedback collection: Thumbs up/down on responses for quality monitoring
  • Transparency: Indicate when responses are AI-generated

Common Failure Modes

These are the failures you will encounter in production. Learn to recognize and fix them before your users report them.

Failure ModeSymptomsRoot CauseFix
Prompt injection Model ignores system prompt, reveals instructions, changes persona No input validation Add guardrails layer (Lesson 4)
Hallucination Model states false facts confidently No grounding data or weak prompt Add RAG, explicit "I don't know" instructions, fact-checking
Context overflow Model loses early conversation context, gives inconsistent answers Conversation exceeds context window Summary or vector memory (Lesson 5)
Cost spike Monthly bill jumps 3-10x unexpectedly No rate limiting, no model routing, default max_tokens Add gateway with cost tracking (Lessons 3, 7)
Format errors App crashes parsing LLM output as JSON No output validation, model returns invalid format Add format validation, retry on parse failure, use response_format
Latency spikes Response times go from 2s to 30s+ Provider degradation, no caching, large prompts Semantic cache, streaming, fallback providers, prompt compression
Prompt regression Quality drops after a prompt update No regression testing Evaluation suite in CI/CD (Lesson 6)

Debugging LLM Issues

When your LLM application behaves unexpectedly, follow this debugging flowchart:

# LLM Debugging Flowchart
#
# 1. Is the issue with INPUT or OUTPUT?
#    |
#    |- INPUT: Check the full prompt being sent
#    |  |- Is the system prompt correct? (check prompt version)
#    |  |- Is user input being properly sanitized?
#    |  |- Is context/memory being injected correctly?
#    |  |- Is the prompt too long? (check token count)
#    |
#    |- OUTPUT: Check the raw LLM response
#       |- Is the model returning valid format?
#       |- Is the content accurate?
#       |- Is the response being post-processed incorrectly?
#       |- Did the guardrails modify the response?
#
# 2. Is the issue CONSISTENT or INTERMITTENT?
#    |
#    |- CONSISTENT: Likely a prompt or configuration issue
#    |  |- Reproduce with temperature=0
#    |  |- Compare prompt with last known good version
#    |  |- Check if model version changed
#    |
#    |- INTERMITTENT: Likely a non-determinism or context issue
#       |- Log full request/response for failing cases
#       |- Check if context/memory varies between requests
#       |- Check if rate limiting is causing fallback to different model
#
# 3. Quick fixes to try:
#    |- Set temperature=0 for debugging
#    |- Print the FULL prompt (system + user + context)
#    |- Test the same prompt in the provider's playground
#    |- Revert to the last known good prompt version
#    |- Check provider status page for outages
💡
Apply at work: The most common debugging mistake is not logging the full prompt. When someone says "the AI gave a wrong answer," the first thing to check is the exact prompt that was sent — including the system prompt, injected context, memory, and few-shot examples. 90% of LLM bugs are prompt bugs.

Production Architecture Checklist Code

Here is a validation function that checks your LLM application against the production checklist:

def validate_production_readiness(config: dict) -> dict:
    """Check if an LLM application meets production standards."""
    checks = []

    # Security checks
    checks.append({
        "category": "Security",
        "check": "Prompt injection detection",
        "status": "pass" if config.get("guardrails_enabled") else "FAIL",
        "priority": "critical"
    })
    checks.append({
        "category": "Security",
        "check": "PII redaction",
        "status": "pass" if config.get("pii_redaction") else "FAIL",
        "priority": "critical"
    })
    checks.append({
        "category": "Security",
        "check": "Rate limiting",
        "status": "pass" if config.get("rate_limit_per_user") else "FAIL",
        "priority": "critical"
    })

    # Reliability checks
    checks.append({
        "category": "Reliability",
        "check": "Fallback providers configured",
        "status": "pass" if len(config.get("fallback_models", [])) >= 1 else "FAIL",
        "priority": "high"
    })
    checks.append({
        "category": "Reliability",
        "check": "Request timeout set",
        "status": "pass" if config.get("timeout_seconds", 0) > 0 else "FAIL",
        "priority": "high"
    })
    checks.append({
        "category": "Reliability",
        "check": "Output format validation",
        "status": "pass" if config.get("output_validation") else "warn",
        "priority": "high"
    })

    # Quality checks
    checks.append({
        "category": "Quality",
        "check": "Evaluation test suite",
        "status": "pass" if config.get("eval_test_count", 0) >= 10 else "warn",
        "priority": "medium"
    })
    checks.append({
        "category": "Quality",
        "check": "Prompt versioning",
        "status": "pass" if config.get("prompt_versioning") else "warn",
        "priority": "medium"
    })

    # Cost checks
    checks.append({
        "category": "Cost",
        "check": "Cost tracking enabled",
        "status": "pass" if config.get("cost_tracking") else "warn",
        "priority": "medium"
    })
    checks.append({
        "category": "Cost",
        "check": "Model routing (cheap model for simple tasks)",
        "status": "pass" if config.get("model_routing") else "warn",
        "priority": "medium"
    })
    checks.append({
        "category": "Cost",
        "check": "Budget alerts configured",
        "status": "pass" if config.get("budget_alert_usd") else "warn",
        "priority": "low"
    })

    # Summary
    failures = [c for c in checks if c["status"] == "FAIL"]
    warnings = [c for c in checks if c["status"] == "warn"]
    passed = [c for c in checks if c["status"] == "pass"]

    return {
        "ready_for_production": len(failures) == 0,
        "passed": len(passed),
        "warnings": len(warnings),
        "failures": len(failures),
        "total_checks": len(checks),
        "critical_failures": [c for c in failures if c["priority"] == "critical"],
        "details": checks
    }


# Usage
result = validate_production_readiness({
    "guardrails_enabled": True,
    "pii_redaction": True,
    "rate_limit_per_user": 100,
    "fallback_models": ["claude-sonnet", "gpt-4o-mini"],
    "timeout_seconds": 30,
    "output_validation": True,
    "eval_test_count": 15,
    "prompt_versioning": True,
    "cost_tracking": True,
    "model_routing": True,
    "budget_alert_usd": 500
})

print(f"Production ready: {result['ready_for_production']}")
print(f"Score: {result['passed']}/{result['total_checks']} passed, {result['warnings']} warnings")

Frequently Asked Questions

Which LLM provider should I use for production?

Use multiple providers. Configure OpenAI as your primary and Anthropic as your fallback (or vice versa). For simple tasks, use the cheapest option that meets your quality bar (usually GPT-4o-mini or Claude Haiku). For complex reasoning, use GPT-4o or Claude Sonnet. The LLM gateway pattern from Lesson 3 makes it easy to switch between providers without changing application code.

How do I handle LLM hallucinations in production?

Three strategies: (1) Use RAG to ground responses in your actual data so the model has source material to reference. (2) Add explicit instructions in your system prompt: "If you are not confident in the answer, say so. Never make up facts." (3) Build a factuality checker in your output guardrails that compares the response against known facts. The combination of all three reduces hallucination rates from 30-40% to under 5% in most use cases.

What is a realistic LLM budget for a startup?

For a startup with 1,000-10,000 daily active users, expect to spend $500-$3,000/month on LLM APIs before optimization. After implementing model routing and semantic caching, this typically drops to $200-$1,200/month. The key is to track cost per user and cost per feature from day one. If your cost per user exceeds $0.10/month, you need to optimize before scaling.

Should I use LangChain, LlamaIndex, or build my own?

For prototyping, frameworks like LangChain and LlamaIndex speed up development significantly. For production, most teams end up replacing framework components with custom code because frameworks add abstraction layers that make debugging harder and do not match your exact requirements. A good middle ground: use frameworks for prototyping, then gradually replace components with your own code as you understand your production needs. The patterns in this course (gateway, guardrails, memory) are the components most teams build custom.

How do I handle model version changes (e.g., GPT-4o updates)?

Pin your model version explicitly (e.g., "gpt-4o-2024-11-20" instead of "gpt-4o"). When a new version is released: (1) Run your regression test suite against the new version. (2) If all tests pass, run a 10% A/B test in production. (3) Monitor quality metrics for 48 hours. (4) If metrics hold, roll out to 100%. Never auto-upgrade model versions in production without testing.

How do I make my LLM application faster?

Four techniques ordered by impact: (1) Semantic caching — instant responses for cached queries (0ms vs 2-5 seconds). (2) Streaming — show partial responses as they are generated, reducing perceived latency by 60-80%. (3) Prompt compression — shorter prompts process faster. Remove filler words and compress system prompts. (4) Smaller models — GPT-4o-mini is 2-3x faster than GPT-4o for simple tasks. Combine all four and you can get median response times under 1 second.

Do I need to fine-tune a model for my use case?

Probably not. 90% of production LLM applications use prompt engineering + RAG, not fine-tuning. Fine-tune only when: (1) You need the model to adopt a very specific output format or style that prompt engineering cannot achieve. (2) You need to reduce latency and cost by using a smaller fine-tuned model instead of a large general model. (3) You have 500+ high-quality training examples. Start with the prompt management system from Lesson 2 and exhaust prompt engineering before considering fine-tuning.

What is the biggest mistake teams make with production LLM apps?

Going to production without guardrails and evaluation. Teams build a demo that works on 10 test cases, ship it, and then spend months firefighting prompt injections, hallucinations, and cost overruns. The second biggest mistake is treating the LLM call as the entire application. The infrastructure around the LLM call — guardrails, prompt management, gateway, memory, evaluation — is what makes the difference between a demo and a product. Budget 70% of your development time for the infrastructure, not the LLM call.

Course Summary

You now have the complete toolkit for building production LLM applications:

Architecture (Lesson 1)

Five-layer production stack: guardrails, prompt management, gateway, output processing, memory. Four architecture patterns: chain, RAG, agent, multi-agent.

Prompt Management (Lesson 2)

Prompt registry with versioning, A/B testing, few-shot management, and dynamic construction. Decouple prompt iteration from code deployment.

LLM Gateway (Lesson 3)

Multi-provider routing, fallback chains, rate limiting, cost tracking, and semantic caching. The single point through which all LLM calls flow.

Guardrails (Lesson 4)

Prompt injection detection, PII redaction, content policy enforcement, and output validation. Defense in depth for LLM applications.

Memory (Lesson 5)

Buffer, summary, and vector memory patterns. Long-term user memory. Session management at scale.

Evaluation (Lesson 6)

LLM-as-judge, human evaluation workflows, regression testing, and CI/CD integration. Measure quality before shipping.

Cost Optimization (Lesson 7)

Model routing (50-70% savings), semantic caching (40-60%), token optimization (20-30%), batch processing (30-50%). Real cost breakdowns.

Best Practices (Lesson 8)

Production checklist, common failure modes, debugging flowchart, and comprehensive FAQ. Your go-to reference for shipping LLM features.

💡
What to do next: Pick one LLM feature you are building at work. Use the production checklist to audit it. Implement the top three missing items. Then share this course with your team so everyone is building on the same foundation.