Beginner

LLM Application Architecture

Most LLM tutorials show you how to call an API. This lesson shows you how to build a production system around that API call. You will learn the core components every LLM application needs, how to choose between building and buying, and the architecture patterns that separate demo code from production code.

The Production LLM Stack

Every production LLM application has the same five layers, whether it is a customer support chatbot or a code generation tool. The difference between a weekend project and a production system is how many of these layers you actually implement.

# Production LLM Application Stack
#
#  User Request
#       |
#  [1. Guardrails Layer]     -- Input validation, PII filtering, injection detection
#       |
#  [2. Prompt Management]    -- Template selection, variable injection, version control
#       |
#  [3. LLM Gateway]          -- Provider routing, rate limiting, fallbacks, caching
#       |
#  [4. Output Processing]    -- Format validation, factuality checks, content policy
#       |
#  [5. Memory & State]       -- Conversation history, user context, session management
#       |
#  Response to User

We will cover each of these layers in detail in subsequent lessons. For now, let us understand what each layer does and why you need it.

Layer 1: Guardrails

The guardrails layer sits between your users and your LLM. It validates inputs before they reach the model and validates outputs before they reach the user. Without it, you are one creative prompt away from your chatbot leaking your system prompt, generating harmful content, or returning PII from your training data.

Input guardrails: Prompt injection detection, PII redaction, topic restriction, input length limits
Output guardrails: Content policy enforcement, format validation, factuality checks, PII detection in responses

Layer 2: Prompt Management

Hard-coded prompts in your source code are the equivalent of hard-coded SQL queries. They work for prototypes but become unmaintainable at scale. A prompt management system lets you version, test, and update prompts without deploying code.

Prompt templates: Parameterized prompts with variable injection
Version control: Track prompt changes with rollback capability
A/B testing: Compare prompt variants in production
Few-shot management: Curate and rotate example sets

Layer 3: LLM Gateway

The LLM gateway is the central point through which all LLM calls flow. It handles provider routing, rate limiting, cost tracking, and caching. Without it, you are locked into a single provider with no fallbacks and no visibility into costs.

Multi-provider routing: Route requests to OpenAI, Anthropic, or local models based on task requirements
Fallback chains: Automatically switch providers when one is down
Rate limiting: Protect against quota exhaustion and cost spikes
Semantic caching: Cache responses for semantically similar queries

Layer 4: Output Processing

Raw LLM outputs need post-processing before they are safe to show users. This layer validates response format, checks for policy violations, and transforms outputs into the structure your application expects.

Layer 5: Memory & State

LLMs are stateless. Every API call starts from scratch. Memory systems maintain conversation context, user preferences, and session state across interactions.

💡

Apply at work: Before building any LLM feature, draw out which of these five layers you need. Most production issues come from skipping guardrails or memory — the two layers that demo code never includes.

Build vs API: The Decision Framework

The first architectural decision you face is what to build yourself versus what to use off-the-shelf. Here is a practical framework:

Component	Build When	Buy/Use API When
LLM itself	You need data privacy, offline inference, or extreme customization	Almost always — OpenAI, Anthropic, Google APIs are far cheaper than hosting
Prompt management	You have 10+ prompts in production or need A/B testing	You have fewer than 5 prompts and they rarely change
Guardrails	You have domain-specific safety requirements	Generic content moderation is sufficient (use OpenAI moderation API)
Gateway	You use multiple providers or need fine-grained cost tracking	You only use one provider and cost is not a concern
Memory	Always build this — no off-the-shelf solution fits every use case	—

Model Selection Framework

Choosing the right model is not about picking "the best one." It is about matching model capabilities to task requirements while staying within your cost budget.

# Model Selection Decision Tree (Python)
def select_model(task):
    """Route tasks to the right model based on requirements."""

    # Step 1: Does it need reasoning?
    if task.requires_complex_reasoning:
        if task.budget_per_call > 0.05:
            return "claude-sonnet-4-20250514"    # Best reasoning
        return "gpt-4o"                  # Good reasoning, lower cost

    # Step 2: Is it a simple task?
    if task.is_classification or task.is_extraction:
        if task.latency_requirement_ms < 500:
            return "gpt-4o-mini"         # Fast, cheap, good enough
        return "claude-haiku"            # Even cheaper

    # Step 3: Does it need large context?
    if task.input_tokens > 100_000:
        return "gemini-2.0-pro"          # 1M+ context window

    # Step 4: Does it need to run locally?
    if task.requires_data_privacy:
        return "llama-3-70b"             # Self-hosted, no data leaves your infra

    # Default: best quality-to-cost ratio
    return "gpt-4o"

📝

Production reality: Most production LLM apps use 2-3 different models. A cheap model handles 80% of requests (classification, simple extraction), and an expensive model handles the 20% that need reasoning. This alone can cut your LLM costs by 60%.

Architecture Patterns

There are four main architecture patterns for LLM applications, each suited to different complexity levels:

Pattern 1: Simple Chain

A linear sequence of LLM calls where the output of one step feeds into the next. Use this for straightforward tasks like summarize-then-translate or extract-then-classify.

# Simple Chain: Summarize then Extract
def process_document(document: str) -> dict:
    # Step 1: Summarize
    summary = llm.complete(
        prompt=f"Summarize this document in 3 sentences:\n{document}",
        model="gpt-4o-mini"
    )

    # Step 2: Extract key entities from summary
    entities = llm.complete(
        prompt=f"Extract company names, dates, and amounts from:\n{summary}",
        model="gpt-4o-mini",
        response_format={"type": "json_object"}
    )

    return {"summary": summary, "entities": json.loads(entities)}

Pattern 2: RAG (Retrieval-Augmented Generation)

Augment the LLM's knowledge by retrieving relevant documents from a vector database before generating a response. Use this when the LLM needs access to your private data.

# RAG Pattern
def answer_question(question: str) -> str:
    # Retrieve relevant context
    docs = vector_db.search(embed(question), top_k=5)
    context = "\n".join([doc.text for doc in docs])

    # Generate with context
    return llm.complete(
        prompt=f"Context:\n{context}\n\nQuestion: {question}\nAnswer:",
        model="gpt-4o"
    )

Pattern 3: Agent with Tools

Give the LLM access to tools (APIs, databases, calculators) and let it decide which tools to use to accomplish a task. Use this when the task requires real-world actions or dynamic decision-making.

# Agent Pattern with Tool Use
import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "search_database",
        "description": "Search the product database by query",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "limit": {"type": "integer", "default": 5}
            },
            "required": ["query"]
        }
    },
    {
        "name": "create_ticket",
        "description": "Create a support ticket",
        "input_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "priority": {"type": "string", "enum": ["low", "medium", "high"]},
                "description": {"type": "string"}
            },
            "required": ["title", "priority", "description"]
        }
    }
]

def run_agent(user_message: str):
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            tools=tools,
            messages=messages
        )

        # If the model wants to use a tool
        if response.stop_reason == "tool_use":
            tool_call = next(b for b in response.content if b.type == "tool_use")
            result = execute_tool(tool_call.name, tool_call.input)

            messages.append({"role": "assistant", "content": response.content})
            messages.append({
                "role": "user",
                "content": [{"type": "tool_result", "tool_use_id": tool_call.id, "content": result}]
            })
        else:
            # Model is done, return the text response
            return next(b for b in response.content if b.type == "text").text

Pattern 4: Multi-Agent System

Multiple specialized agents collaborate to solve complex tasks. Each agent has its own system prompt, tools, and model. Use this for complex workflows like research analysis, code generation with review, or customer support with escalation.

# Multi-Agent Pattern
class Agent:
    def __init__(self, name, system_prompt, model, tools=None):
        self.name = name
        self.system_prompt = system_prompt
        self.model = model
        self.tools = tools or []

    def run(self, message: str) -> str:
        return llm.complete(
            system=self.system_prompt,
            prompt=message,
            model=self.model,
            tools=self.tools
        )

# Define specialized agents
researcher = Agent(
    name="researcher",
    system_prompt="You research topics thoroughly using search tools.",
    model="gpt-4o",
    tools=[web_search, doc_search]
)

writer = Agent(
    name="writer",
    system_prompt="You write clear, concise reports from research notes.",
    model="claude-sonnet-4-20250514"
)

reviewer = Agent(
    name="reviewer",
    system_prompt="You review reports for accuracy, clarity, and completeness.",
    model="gpt-4o-mini"
)

# Orchestrate the agents
def generate_report(topic: str) -> str:
    research = researcher.run(f"Research this topic: {topic}")
    draft = writer.run(f"Write a report based on this research:\n{research}")
    feedback = reviewer.run(f"Review this report:\n{draft}")
    final = writer.run(f"Revise the report based on feedback:\n{draft}\n\nFeedback:\n{feedback}")
    return final

💡

Apply at work: Start with Pattern 1 (simple chain) and only add complexity when you need it. Most production LLM apps are chains or RAG. Agents and multi-agent systems are powerful but significantly harder to debug, test, and make reliable. Only use them when the task genuinely requires dynamic decision-making.

Key Takeaways

Production LLM apps have five layers: guardrails, prompt management, LLM gateway, output processing, and memory. Most failures come from skipping layers.
Use the build-vs-buy framework to decide what to build yourself. Build memory always; build guardrails and prompt management when you have domain-specific needs.
Use 2-3 models in production: cheap models for simple tasks, expensive models for reasoning. This cuts costs by 60%.
Start with simple chains. Only move to agents or multi-agent patterns when the task requires dynamic decision-making.
The difference between demo code and production code is the infrastructure around the LLM call, not the LLM call itself.

What Is Next

In the next lesson, we will build the prompt management system — the layer that versions, templates, and A/B tests your prompts. You will get a complete Python implementation of a prompt registry with version control and dynamic prompt construction.

Next → Prompt Management System