Intermediate

Generation

The final step: construct prompts with retrieved context and generate accurate, cited answers using LLMs.

The RAG Prompt Template

The key to good RAG generation is a well-structured prompt that clearly separates the context from the question:

Python - RAG Prompt Template
from langchain_core.prompts import ChatPromptTemplate

rag_prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant. Answer the question based ONLY
on the following context. If the context does not contain
enough information to answer, say "I don't have enough
information to answer this question."

Context:
{context}

Question: {question}

Instructions:
- Answer based only on the provided context
- Cite your sources using [Source: filename] format
- If multiple sources agree, mention that
- Be concise but thorough
""")

Complete RAG Chain

Python - Full RAG Chain
from langchain_anthropic import ChatAnthropic
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Initialize components
llm = ChatAnthropic(model="claude-sonnet-4-20250514")
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Format retrieved docs into a single context string
def format_docs(docs):
    return "\n\n---\n\n".join(
        f"[Source: {d.metadata.get('source', 'unknown')}]\n{d.page_content}"
        for d in docs
    )

# Build the RAG chain
rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | rag_prompt
    | llm
    | StrOutputParser()
)

# Ask a question
answer = rag_chain.invoke("How do I configure authentication?")
print(answer)

Context Window Management

LLMs have limited context windows. You need to manage how much retrieved context you include:

Python - Token Management
import tiktoken

def trim_context_to_tokens(docs, max_tokens=4000):
    """Include as many docs as fit in the token budget."""
    encoder = tiktoken.encoding_for_model("gpt-4")
    selected = []
    total_tokens = 0

    for doc in docs:
        doc_tokens = len(encoder.encode(doc.page_content))
        if total_tokens + doc_tokens > max_tokens:
            break
        selected.append(doc)
        total_tokens += doc_tokens

    return selected

Citation and Source Attribution

Python - Citations
citation_prompt = ChatPromptTemplate.from_template("""
Answer the question using the provided sources. For every
claim in your answer, cite the source using [1], [2], etc.
After your answer, list all sources used.

Sources:
{numbered_sources}

Question: {question}

Answer (with citations):
""")

def format_numbered_sources(docs):
    return "\n\n".join(
        f"[{i+1}] {d.metadata.get('source', 'unknown')}:\n{d.page_content}"
        for i, d in enumerate(docs)
    )

Streaming Responses

Python - Streaming
# Stream the response token by token
for chunk in rag_chain.stream("What is our SLA?"):
    print(chunk, end="", flush=True)

# Or with async
async for chunk in rag_chain.astream("What is our SLA?"):
    yield chunk

Multi-Turn RAG Conversations

Handle follow-up questions by incorporating conversation history:

Python - Conversational RAG
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder

# Contextualize question based on chat history
contextualize_prompt = ChatPromptTemplate.from_messages([
    ("system", "Given the chat history and latest question, "
     "reformulate the question to be standalone."),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}")
])

history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_prompt
)

# User: "What is the refund policy?"
# AI: "The refund policy allows returns within 30 days..."
# User: "What about digital products?"
# Reformulated: "What is the refund policy for digital products?"

Using Different LLMs

Model Best For Context Window Notes
Claude Sonnet General RAG, coding 200K tokens Excellent at following instructions
Claude Opus Complex reasoning 200K tokens Best for nuanced analysis
GPT-4o General purpose 128K tokens Good balance of quality and speed
Gemini 2.0 Multi-modal RAG 1M tokens Largest context window
Llama 3 Self-hosted, privacy 128K tokens Free, runs locally
Model selection tip: Use a fast, cheap model (Claude Haiku, GPT-4o mini) for simple factual Q&A. Use a powerful model (Claude Sonnet/Opus, GPT-4o) for complex analysis, synthesis, and multi-step reasoning.

What's Next?

The next lesson covers evaluation — how to measure and improve the quality of your RAG system.