Generation Pipeline Design
The generation layer is where retrieval results become user-facing answers. A well-designed generation pipeline handles prompt construction, source citation, hallucination prevention, streaming, and context window management — all of which directly impact user trust and system reliability.
Prompt Engineering for RAG
RAG prompts are different from general-purpose prompts. The key principle is: constrain the model to answer only from the provided context. Without this constraint, the model will happily blend retrieved facts with its own parametric knowledge, producing confident-sounding answers that are partially wrong.
Production RAG System Prompt
SYSTEM_PROMPT = """You are a helpful assistant that answers questions based on
the provided context documents. Follow these rules strictly:
1. ONLY use information from the provided context to answer.
2. If the context does not contain enough information to answer,
say "I don't have enough information to answer this question."
3. NEVER make up information that is not in the context.
4. Cite your sources using [Source: filename, page X] format.
5. If multiple sources contain relevant information, synthesize
them and cite each source.
6. Keep answers concise and actionable.
Context documents:
{context}
"""
USER_PROMPT = """{question}"""
Context Formatting Strategies
How you format the retrieved chunks in the prompt affects generation quality:
def format_context_numbered(chunks: list[dict]) -> str:
"""Number each source for easy citation."""
formatted = []
for i, chunk in enumerate(chunks, 1):
source = chunk["metadata"].get("source", "Unknown")
page = chunk["metadata"].get("page", "N/A")
formatted.append(
f"[Source {i}: {source}, Page {page}]\n{chunk['text']}\n"
)
return "\n---\n".join(formatted)
# Output:
# [Source 1: user-guide.pdf, Page 12]
# To reset your password, navigate to Settings > Security > Change Password...
# ---
# [Source 2: faq.md, Page 3]
# Common password issues: If you cannot reset your password, contact support...
Citation and Source Attribution
Users need to verify AI-generated answers. Source attribution builds trust and enables fact-checking. There are two approaches:
Inline Citations
# Prompt the model to cite inline
SYSTEM_PROMPT = """...
When answering, cite sources inline using [1], [2], etc.
At the end of your answer, list the full source references.
..."""
# Model output:
# "The refund window is 30 days from purchase [1]. After that,
# a prorated refund is available for annual plans [2].
#
# Sources:
# [1] refund-policy.pdf, Page 2
# [2] enterprise-terms.pdf, Page 8"
Post-Processing Citation Extraction
import re
def extract_and_verify_citations(answer: str, sources: list[dict]) -> dict:
"""Extract citations and verify they reference real sources."""
citation_pattern = r'\[(\d+)\]'
cited_indices = set(int(m) for m in re.findall(citation_pattern, answer))
verified = []
for idx in cited_indices:
if 1 <= idx <= len(sources):
verified.append({
"index": idx,
"source": sources[idx - 1]["metadata"]["source"],
"page": sources[idx - 1]["metadata"].get("page"),
"verified": True
})
else:
verified.append({"index": idx, "verified": False})
return {
"answer": answer,
"citations": verified,
"has_unverified": any(not c["verified"] for c in verified)
}
Hallucination Detection and Prevention
Even with good prompts, LLMs occasionally generate claims not supported by the retrieved context. Here are practical detection strategies:
Strategy 1: Self-Consistency Check
def check_faithfulness(answer: str, context: str, llm) -> dict:
"""Ask the LLM to verify its own answer against the context."""
verification_prompt = f"""Given the following context and answer,
identify any claims in the answer that are NOT supported by the context.
Context: {context}
Answer: {answer}
For each claim in the answer, state whether it is:
- SUPPORTED: directly stated or clearly implied by the context
- NOT SUPPORTED: not found in the context
- CONTRADICTED: conflicts with the context
Output as JSON array."""
result = llm.invoke(verification_prompt)
return parse_verification(result)
Strategy 2: Confidence Scoring
def generate_with_confidence(question: str, context: str, llm) -> dict:
"""Generate answer with a confidence score."""
prompt = f"""Based on the provided context, answer the question.
Also rate your confidence that the answer is fully supported by the context.
Context: {context}
Question: {question}
Respond in JSON format:
{{"answer": "your answer", "confidence": 0.0-1.0, "reasoning": "why this confidence level"}}
"""
result = llm.invoke(prompt)
parsed = json.loads(result)
# Flag low-confidence answers for human review
if parsed["confidence"] < 0.7:
parsed["needs_review"] = True
parsed["answer"] += "\n\n⚠ This answer may not be fully supported by available documents."
return parsed
Streaming Responses
Users expect fast responses. Streaming the LLM output token-by-token gives a perceived latency of ~200ms (time to first token) instead of 2–5 seconds (time to full response).
from openai import OpenAI
client = OpenAI()
def stream_rag_response(question: str, context: str):
"""Stream the RAG response token by token."""
stream = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Answer based on this context:\n{context}"},
{"role": "user", "content": question}
],
stream=True
)
full_response = ""
for chunk in stream:
if chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
full_response += token
yield token # Send to client immediately
# After streaming completes, run citation verification
# on the full response (async, non-blocking)
return full_response
# FastAPI streaming endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
@app.post("/ask")
async def ask(question: str):
context = retrieve_context(question) # Your retrieval pipeline
return StreamingResponse(
stream_rag_response(question, context),
media_type="text/plain"
)
Context Window Management
Every LLM has a finite context window. When your retrieved chunks exceed this limit, you need a strategy to fit the most relevant content into the available space.
Token Budget Allocation
import tiktoken
def allocate_context(
question: str,
chunks: list[dict],
model: str = "gpt-4o",
max_context_tokens: int = 12000, # Reserve space for system prompt + answer
) -> list[dict]:
"""Select chunks that fit within the token budget."""
enc = tiktoken.encoding_for_model(model)
selected = []
used_tokens = len(enc.encode(question)) + 500 # 500 for system prompt overhead
for chunk in chunks: # Assumes chunks are already ranked by relevance
chunk_tokens = len(enc.encode(chunk["text"]))
if used_tokens + chunk_tokens <= max_context_tokens:
selected.append(chunk)
used_tokens += chunk_tokens
else:
break # Stop adding chunks when budget is exhausted
return selected
# Example token budget for GPT-4o (128K context):
# System prompt: ~500 tokens
# Retrieved context: ~12,000 tokens (5-10 chunks)
# User question: ~100 tokens
# Reserved for answer: ~4,000 tokens
# Total used: ~16,600 tokens (well within 128K)
Key Takeaways
- RAG system prompts must explicitly constrain the model to answer only from provided context.
- Format retrieved chunks with numbered source labels to enable reliable inline citations.
- Layer multiple hallucination defenses: prompt constraints, self-consistency checks, and confidence scoring.
- Stream responses to reduce perceived latency from seconds to milliseconds.
- Keep context to 10–15K tokens even with large context windows — more relevant content beats more content.
Lilly Tech Systems