Intermediate

Generation

The final step: construct prompts with retrieved context and generate accurate, cited answers using LLMs.

The RAG Prompt Template

The key to good RAG generation is a well-structured prompt that clearly separates the context from the question:

Python - RAG Prompt Template

from langchain_core.prompts import ChatPromptTemplate

rag_prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant. Answer the question based ONLY
on the following context. If the context does not contain
enough information to answer, say "I don't have enough
information to answer this question."

Context:
{context}

Question: {question}

Instructions:
- Answer based only on the provided context
- Cite your sources using [Source: filename] format
- If multiple sources agree, mention that
- Be concise but thorough
""")

Complete RAG Chain

Python - Full RAG Chain

from langchain_anthropic import ChatAnthropic
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Initialize components
llm = ChatAnthropic(model="claude-sonnet-4-20250514")
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Format retrieved docs into a single context string
def format_docs(docs):
    return "\n\n---\n\n".join(
        f"[Source: {d.metadata.get('source', 'unknown')}]\n{d.page_content}"
        for d in docs
    )

# Build the RAG chain
rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | rag_prompt
    | llm
    | StrOutputParser()
)

# Ask a question
answer = rag_chain.invoke("How do I configure authentication?")
print(answer)

Context Window Management

LLMs have limited context windows. You need to manage how much retrieved context you include:

Python - Token Management

import tiktoken

def trim_context_to_tokens(docs, max_tokens=4000):
    """Include as many docs as fit in the token budget."""
    encoder = tiktoken.encoding_for_model("gpt-4")
    selected = []
    total_tokens = 0

    for doc in docs:
        doc_tokens = len(encoder.encode(doc.page_content))
        if total_tokens + doc_tokens > max_tokens:
            break
        selected.append(doc)
        total_tokens += doc_tokens

    return selected

Citation and Source Attribution

Python - Citations

citation_prompt = ChatPromptTemplate.from_template("""
Answer the question using the provided sources. For every
claim in your answer, cite the source using [1], [2], etc.
After your answer, list all sources used.

Sources:
{numbered_sources}

Question: {question}

Answer (with citations):
""")

def format_numbered_sources(docs):
    return "\n\n".join(
        f"[{i+1}] {d.metadata.get('source', 'unknown')}:\n{d.page_content}"
        for i, d in enumerate(docs)
    )

Streaming Responses

Python - Streaming

# Stream the response token by token
for chunk in rag_chain.stream("What is our SLA?"):
    print(chunk, end="", flush=True)

# Or with async
async for chunk in rag_chain.astream("What is our SLA?"):
    yield chunk

Multi-Turn RAG Conversations

Handle follow-up questions by incorporating conversation history:

Python - Conversational RAG

from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder

# Contextualize question based on chat history
contextualize_prompt = ChatPromptTemplate.from_messages([
    ("system", "Given the chat history and latest question, "
     "reformulate the question to be standalone."),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}")
])

history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_prompt
)

# User: "What is the refund policy?"
# AI: "The refund policy allows returns within 30 days..."
# User: "What about digital products?"
# Reformulated: "What is the refund policy for digital products?"

Using Different LLMs

Model	Best For	Context Window	Notes
Claude Sonnet	General RAG, coding	200K tokens	Excellent at following instructions
Claude Opus	Complex reasoning	200K tokens	Best for nuanced analysis
GPT-4o	General purpose	128K tokens	Good balance of quality and speed
Gemini 2.0	Multi-modal RAG	1M tokens	Largest context window
Llama 3	Self-hosted, privacy	128K tokens	Free, runs locally

✅

Model selection tip: Use a fast, cheap model (Claude Haiku, GPT-4o mini) for simple factual Q&A. Use a powerful model (Claude Sonnet/Opus, GPT-4o) for complex analysis, synthesis, and multi-step reasoning.

What's Next?

The next lesson covers evaluation — how to measure and improve the quality of your RAG system.

← Previous Retrieval & Reranking Next → Evaluation