Advanced

Generation Pipeline Design

The generation layer is where retrieval results become user-facing answers. A well-designed generation pipeline handles prompt construction, source citation, hallucination prevention, streaming, and context window management — all of which directly impact user trust and system reliability.

Prompt Engineering for RAG

RAG prompts are different from general-purpose prompts. The key principle is: constrain the model to answer only from the provided context. Without this constraint, the model will happily blend retrieved facts with its own parametric knowledge, producing confident-sounding answers that are partially wrong.

Production RAG System Prompt

SYSTEM_PROMPT = """You are a helpful assistant that answers questions based on
the provided context documents. Follow these rules strictly:

1. ONLY use information from the provided context to answer.
2. If the context does not contain enough information to answer,
   say "I don't have enough information to answer this question."
3. NEVER make up information that is not in the context.
4. Cite your sources using [Source: filename, page X] format.
5. If multiple sources contain relevant information, synthesize
   them and cite each source.
6. Keep answers concise and actionable.

Context documents:
{context}
"""

USER_PROMPT = """{question}"""
💡
Apply at work: The phrase "I don't have enough information" is critical. Without it, the model will answer every question regardless of whether the retrieved context is relevant. Test your system with questions that are deliberately outside your knowledge base to verify this behavior works.

Context Formatting Strategies

How you format the retrieved chunks in the prompt affects generation quality:

def format_context_numbered(chunks: list[dict]) -> str:
    """Number each source for easy citation."""
    formatted = []
    for i, chunk in enumerate(chunks, 1):
        source = chunk["metadata"].get("source", "Unknown")
        page = chunk["metadata"].get("page", "N/A")
        formatted.append(
            f"[Source {i}: {source}, Page {page}]\n{chunk['text']}\n"
        )
    return "\n---\n".join(formatted)

# Output:
# [Source 1: user-guide.pdf, Page 12]
# To reset your password, navigate to Settings > Security > Change Password...
# ---
# [Source 2: faq.md, Page 3]
# Common password issues: If you cannot reset your password, contact support...

Citation and Source Attribution

Users need to verify AI-generated answers. Source attribution builds trust and enables fact-checking. There are two approaches:

Inline Citations

# Prompt the model to cite inline
SYSTEM_PROMPT = """...
When answering, cite sources inline using [1], [2], etc.
At the end of your answer, list the full source references.
..."""

# Model output:
# "The refund window is 30 days from purchase [1]. After that,
#  a prorated refund is available for annual plans [2].
#
#  Sources:
#  [1] refund-policy.pdf, Page 2
#  [2] enterprise-terms.pdf, Page 8"

Post-Processing Citation Extraction

import re

def extract_and_verify_citations(answer: str, sources: list[dict]) -> dict:
    """Extract citations and verify they reference real sources."""
    citation_pattern = r'\[(\d+)\]'
    cited_indices = set(int(m) for m in re.findall(citation_pattern, answer))

    verified = []
    for idx in cited_indices:
        if 1 <= idx <= len(sources):
            verified.append({
                "index": idx,
                "source": sources[idx - 1]["metadata"]["source"],
                "page": sources[idx - 1]["metadata"].get("page"),
                "verified": True
            })
        else:
            verified.append({"index": idx, "verified": False})

    return {
        "answer": answer,
        "citations": verified,
        "has_unverified": any(not c["verified"] for c in verified)
    }

Hallucination Detection and Prevention

Even with good prompts, LLMs occasionally generate claims not supported by the retrieved context. Here are practical detection strategies:

Strategy 1: Self-Consistency Check

def check_faithfulness(answer: str, context: str, llm) -> dict:
    """Ask the LLM to verify its own answer against the context."""
    verification_prompt = f"""Given the following context and answer,
identify any claims in the answer that are NOT supported by the context.

Context: {context}

Answer: {answer}

For each claim in the answer, state whether it is:
- SUPPORTED: directly stated or clearly implied by the context
- NOT SUPPORTED: not found in the context
- CONTRADICTED: conflicts with the context

Output as JSON array."""

    result = llm.invoke(verification_prompt)
    return parse_verification(result)

Strategy 2: Confidence Scoring

def generate_with_confidence(question: str, context: str, llm) -> dict:
    """Generate answer with a confidence score."""
    prompt = f"""Based on the provided context, answer the question.
Also rate your confidence that the answer is fully supported by the context.

Context: {context}
Question: {question}

Respond in JSON format:
{{"answer": "your answer", "confidence": 0.0-1.0, "reasoning": "why this confidence level"}}
"""
    result = llm.invoke(prompt)
    parsed = json.loads(result)

    # Flag low-confidence answers for human review
    if parsed["confidence"] < 0.7:
        parsed["needs_review"] = True
        parsed["answer"] += "\n\n⚠ This answer may not be fully supported by available documents."

    return parsed
📝
Production reality: No hallucination detection method is 100% accurate. Use multiple strategies: prompt constraints + self-consistency + confidence scoring + automated evaluation metrics (covered in Lesson 6). The goal is to catch 90%+ of hallucinations before they reach users.

Streaming Responses

Users expect fast responses. Streaming the LLM output token-by-token gives a perceived latency of ~200ms (time to first token) instead of 2–5 seconds (time to full response).

from openai import OpenAI

client = OpenAI()

def stream_rag_response(question: str, context: str):
    """Stream the RAG response token by token."""
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer based on this context:\n{context}"},
            {"role": "user", "content": question}
        ],
        stream=True
    )

    full_response = ""
    for chunk in stream:
        if chunk.choices[0].delta.content:
            token = chunk.choices[0].delta.content
            full_response += token
            yield token  # Send to client immediately

    # After streaming completes, run citation verification
    # on the full response (async, non-blocking)
    return full_response

# FastAPI streaming endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/ask")
async def ask(question: str):
    context = retrieve_context(question)  # Your retrieval pipeline
    return StreamingResponse(
        stream_rag_response(question, context),
        media_type="text/plain"
    )

Context Window Management

Every LLM has a finite context window. When your retrieved chunks exceed this limit, you need a strategy to fit the most relevant content into the available space.

Token Budget Allocation

import tiktoken

def allocate_context(
    question: str,
    chunks: list[dict],
    model: str = "gpt-4o",
    max_context_tokens: int = 12000,  # Reserve space for system prompt + answer
) -> list[dict]:
    """Select chunks that fit within the token budget."""
    enc = tiktoken.encoding_for_model(model)

    selected = []
    used_tokens = len(enc.encode(question)) + 500  # 500 for system prompt overhead

    for chunk in chunks:  # Assumes chunks are already ranked by relevance
        chunk_tokens = len(enc.encode(chunk["text"]))
        if used_tokens + chunk_tokens <= max_context_tokens:
            selected.append(chunk)
            used_tokens += chunk_tokens
        else:
            break  # Stop adding chunks when budget is exhausted

    return selected

# Example token budget for GPT-4o (128K context):
# System prompt:     ~500 tokens
# Retrieved context: ~12,000 tokens (5-10 chunks)
# User question:     ~100 tokens
# Reserved for answer: ~4,000 tokens
# Total used:        ~16,600 tokens (well within 128K)
💡
Apply at work: Even though GPT-4o supports 128K tokens, keep your context to 10–15K tokens. Research shows that LLM accuracy degrades with very long contexts (the "lost in the middle" problem). More context is not always better — more relevant context is.

Key Takeaways

  • RAG system prompts must explicitly constrain the model to answer only from provided context.
  • Format retrieved chunks with numbered source labels to enable reliable inline citations.
  • Layer multiple hallucination defenses: prompt constraints, self-consistency checks, and confidence scoring.
  • Stream responses to reduce perceived latency from seconds to milliseconds.
  • Keep context to 10–15K tokens even with large context windows — more relevant content beats more content.