LLM Application Architecture
Most LLM tutorials show you how to call an API. This lesson shows you how to build a production system around that API call. You will learn the core components every LLM application needs, how to choose between building and buying, and the architecture patterns that separate demo code from production code.
The Production LLM Stack
Every production LLM application has the same five layers, whether it is a customer support chatbot or a code generation tool. The difference between a weekend project and a production system is how many of these layers you actually implement.
# Production LLM Application Stack
#
# User Request
# |
# [1. Guardrails Layer] -- Input validation, PII filtering, injection detection
# |
# [2. Prompt Management] -- Template selection, variable injection, version control
# |
# [3. LLM Gateway] -- Provider routing, rate limiting, fallbacks, caching
# |
# [4. Output Processing] -- Format validation, factuality checks, content policy
# |
# [5. Memory & State] -- Conversation history, user context, session management
# |
# Response to User
We will cover each of these layers in detail in subsequent lessons. For now, let us understand what each layer does and why you need it.
Layer 1: Guardrails
The guardrails layer sits between your users and your LLM. It validates inputs before they reach the model and validates outputs before they reach the user. Without it, you are one creative prompt away from your chatbot leaking your system prompt, generating harmful content, or returning PII from your training data.
- Input guardrails: Prompt injection detection, PII redaction, topic restriction, input length limits
- Output guardrails: Content policy enforcement, format validation, factuality checks, PII detection in responses
Layer 2: Prompt Management
Hard-coded prompts in your source code are the equivalent of hard-coded SQL queries. They work for prototypes but become unmaintainable at scale. A prompt management system lets you version, test, and update prompts without deploying code.
- Prompt templates: Parameterized prompts with variable injection
- Version control: Track prompt changes with rollback capability
- A/B testing: Compare prompt variants in production
- Few-shot management: Curate and rotate example sets
Layer 3: LLM Gateway
The LLM gateway is the central point through which all LLM calls flow. It handles provider routing, rate limiting, cost tracking, and caching. Without it, you are locked into a single provider with no fallbacks and no visibility into costs.
- Multi-provider routing: Route requests to OpenAI, Anthropic, or local models based on task requirements
- Fallback chains: Automatically switch providers when one is down
- Rate limiting: Protect against quota exhaustion and cost spikes
- Semantic caching: Cache responses for semantically similar queries
Layer 4: Output Processing
Raw LLM outputs need post-processing before they are safe to show users. This layer validates response format, checks for policy violations, and transforms outputs into the structure your application expects.
Layer 5: Memory & State
LLMs are stateless. Every API call starts from scratch. Memory systems maintain conversation context, user preferences, and session state across interactions.
Build vs API: The Decision Framework
The first architectural decision you face is what to build yourself versus what to use off-the-shelf. Here is a practical framework:
| Component | Build When | Buy/Use API When |
|---|---|---|
| LLM itself | You need data privacy, offline inference, or extreme customization | Almost always — OpenAI, Anthropic, Google APIs are far cheaper than hosting |
| Prompt management | You have 10+ prompts in production or need A/B testing | You have fewer than 5 prompts and they rarely change |
| Guardrails | You have domain-specific safety requirements | Generic content moderation is sufficient (use OpenAI moderation API) |
| Gateway | You use multiple providers or need fine-grained cost tracking | You only use one provider and cost is not a concern |
| Memory | Always build this — no off-the-shelf solution fits every use case | — |
Model Selection Framework
Choosing the right model is not about picking "the best one." It is about matching model capabilities to task requirements while staying within your cost budget.
# Model Selection Decision Tree (Python)
def select_model(task):
"""Route tasks to the right model based on requirements."""
# Step 1: Does it need reasoning?
if task.requires_complex_reasoning:
if task.budget_per_call > 0.05:
return "claude-sonnet-4-20250514" # Best reasoning
return "gpt-4o" # Good reasoning, lower cost
# Step 2: Is it a simple task?
if task.is_classification or task.is_extraction:
if task.latency_requirement_ms < 500:
return "gpt-4o-mini" # Fast, cheap, good enough
return "claude-haiku" # Even cheaper
# Step 3: Does it need large context?
if task.input_tokens > 100_000:
return "gemini-2.0-pro" # 1M+ context window
# Step 4: Does it need to run locally?
if task.requires_data_privacy:
return "llama-3-70b" # Self-hosted, no data leaves your infra
# Default: best quality-to-cost ratio
return "gpt-4o"
Architecture Patterns
There are four main architecture patterns for LLM applications, each suited to different complexity levels:
Pattern 1: Simple Chain
A linear sequence of LLM calls where the output of one step feeds into the next. Use this for straightforward tasks like summarize-then-translate or extract-then-classify.
# Simple Chain: Summarize then Extract
def process_document(document: str) -> dict:
# Step 1: Summarize
summary = llm.complete(
prompt=f"Summarize this document in 3 sentences:\n{document}",
model="gpt-4o-mini"
)
# Step 2: Extract key entities from summary
entities = llm.complete(
prompt=f"Extract company names, dates, and amounts from:\n{summary}",
model="gpt-4o-mini",
response_format={"type": "json_object"}
)
return {"summary": summary, "entities": json.loads(entities)}
Pattern 2: RAG (Retrieval-Augmented Generation)
Augment the LLM's knowledge by retrieving relevant documents from a vector database before generating a response. Use this when the LLM needs access to your private data.
# RAG Pattern
def answer_question(question: str) -> str:
# Retrieve relevant context
docs = vector_db.search(embed(question), top_k=5)
context = "\n".join([doc.text for doc in docs])
# Generate with context
return llm.complete(
prompt=f"Context:\n{context}\n\nQuestion: {question}\nAnswer:",
model="gpt-4o"
)
Pattern 3: Agent with Tools
Give the LLM access to tools (APIs, databases, calculators) and let it decide which tools to use to accomplish a task. Use this when the task requires real-world actions or dynamic decision-making.
# Agent Pattern with Tool Use
import anthropic
client = anthropic.Anthropic()
tools = [
{
"name": "search_database",
"description": "Search the product database by query",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string"},
"limit": {"type": "integer", "default": 5}
},
"required": ["query"]
}
},
{
"name": "create_ticket",
"description": "Create a support ticket",
"input_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"priority": {"type": "string", "enum": ["low", "medium", "high"]},
"description": {"type": "string"}
},
"required": ["title", "priority", "description"]
}
}
]
def run_agent(user_message: str):
messages = [{"role": "user", "content": user_message}]
while True:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
tools=tools,
messages=messages
)
# If the model wants to use a tool
if response.stop_reason == "tool_use":
tool_call = next(b for b in response.content if b.type == "tool_use")
result = execute_tool(tool_call.name, tool_call.input)
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{"type": "tool_result", "tool_use_id": tool_call.id, "content": result}]
})
else:
# Model is done, return the text response
return next(b for b in response.content if b.type == "text").text
Pattern 4: Multi-Agent System
Multiple specialized agents collaborate to solve complex tasks. Each agent has its own system prompt, tools, and model. Use this for complex workflows like research analysis, code generation with review, or customer support with escalation.
# Multi-Agent Pattern
class Agent:
def __init__(self, name, system_prompt, model, tools=None):
self.name = name
self.system_prompt = system_prompt
self.model = model
self.tools = tools or []
def run(self, message: str) -> str:
return llm.complete(
system=self.system_prompt,
prompt=message,
model=self.model,
tools=self.tools
)
# Define specialized agents
researcher = Agent(
name="researcher",
system_prompt="You research topics thoroughly using search tools.",
model="gpt-4o",
tools=[web_search, doc_search]
)
writer = Agent(
name="writer",
system_prompt="You write clear, concise reports from research notes.",
model="claude-sonnet-4-20250514"
)
reviewer = Agent(
name="reviewer",
system_prompt="You review reports for accuracy, clarity, and completeness.",
model="gpt-4o-mini"
)
# Orchestrate the agents
def generate_report(topic: str) -> str:
research = researcher.run(f"Research this topic: {topic}")
draft = writer.run(f"Write a report based on this research:\n{research}")
feedback = reviewer.run(f"Review this report:\n{draft}")
final = writer.run(f"Revise the report based on feedback:\n{draft}\n\nFeedback:\n{feedback}")
return final
Key Takeaways
- Production LLM apps have five layers: guardrails, prompt management, LLM gateway, output processing, and memory. Most failures come from skipping layers.
- Use the build-vs-buy framework to decide what to build yourself. Build memory always; build guardrails and prompt management when you have domain-specific needs.
- Use 2-3 models in production: cheap models for simple tasks, expensive models for reasoning. This cuts costs by 60%.
- Start with simple chains. Only move to agents or multi-agent patterns when the task requires dynamic decision-making.
- The difference between demo code and production code is the infrastructure around the LLM call, not the LLM call itself.
What Is Next
In the next lesson, we will build the prompt management system — the layer that versions, templates, and A/B tests your prompts. You will get a complete Python implementation of a prompt registry with version control and dynamic prompt construction.
Lilly Tech Systems