Intermediate
Generation
The final step: construct prompts with retrieved context and generate accurate, cited answers using LLMs.
The RAG Prompt Template
The key to good RAG generation is a well-structured prompt that clearly separates the context from the question:
Python - RAG Prompt Template
from langchain_core.prompts import ChatPromptTemplate rag_prompt = ChatPromptTemplate.from_template(""" You are a helpful assistant. Answer the question based ONLY on the following context. If the context does not contain enough information to answer, say "I don't have enough information to answer this question." Context: {context} Question: {question} Instructions: - Answer based only on the provided context - Cite your sources using [Source: filename] format - If multiple sources agree, mention that - Be concise but thorough """)
Complete RAG Chain
Python - Full RAG Chain
from langchain_anthropic import ChatAnthropic from langchain_core.runnables import RunnablePassthrough from langchain_core.output_parsers import StrOutputParser # Initialize components llm = ChatAnthropic(model="claude-sonnet-4-20250514") retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) # Format retrieved docs into a single context string def format_docs(docs): return "\n\n---\n\n".join( f"[Source: {d.metadata.get('source', 'unknown')}]\n{d.page_content}" for d in docs ) # Build the RAG chain rag_chain = ( { "context": retriever | format_docs, "question": RunnablePassthrough() } | rag_prompt | llm | StrOutputParser() ) # Ask a question answer = rag_chain.invoke("How do I configure authentication?") print(answer)
Context Window Management
LLMs have limited context windows. You need to manage how much retrieved context you include:
Python - Token Management
import tiktoken def trim_context_to_tokens(docs, max_tokens=4000): """Include as many docs as fit in the token budget.""" encoder = tiktoken.encoding_for_model("gpt-4") selected = [] total_tokens = 0 for doc in docs: doc_tokens = len(encoder.encode(doc.page_content)) if total_tokens + doc_tokens > max_tokens: break selected.append(doc) total_tokens += doc_tokens return selected
Citation and Source Attribution
Python - Citations
citation_prompt = ChatPromptTemplate.from_template(""" Answer the question using the provided sources. For every claim in your answer, cite the source using [1], [2], etc. After your answer, list all sources used. Sources: {numbered_sources} Question: {question} Answer (with citations): """) def format_numbered_sources(docs): return "\n\n".join( f"[{i+1}] {d.metadata.get('source', 'unknown')}:\n{d.page_content}" for i, d in enumerate(docs) )
Streaming Responses
Python - Streaming
# Stream the response token by token for chunk in rag_chain.stream("What is our SLA?"): print(chunk, end="", flush=True) # Or with async async for chunk in rag_chain.astream("What is our SLA?"): yield chunk
Multi-Turn RAG Conversations
Handle follow-up questions by incorporating conversation history:
Python - Conversational RAG
from langchain.chains import create_history_aware_retriever from langchain_core.prompts import MessagesPlaceholder # Contextualize question based on chat history contextualize_prompt = ChatPromptTemplate.from_messages([ ("system", "Given the chat history and latest question, " "reformulate the question to be standalone."), MessagesPlaceholder("chat_history"), ("human", "{input}") ]) history_aware_retriever = create_history_aware_retriever( llm, retriever, contextualize_prompt ) # User: "What is the refund policy?" # AI: "The refund policy allows returns within 30 days..." # User: "What about digital products?" # Reformulated: "What is the refund policy for digital products?"
Using Different LLMs
| Model | Best For | Context Window | Notes |
|---|---|---|---|
| Claude Sonnet | General RAG, coding | 200K tokens | Excellent at following instructions |
| Claude Opus | Complex reasoning | 200K tokens | Best for nuanced analysis |
| GPT-4o | General purpose | 128K tokens | Good balance of quality and speed |
| Gemini 2.0 | Multi-modal RAG | 1M tokens | Largest context window |
| Llama 3 | Self-hosted, privacy | 128K tokens | Free, runs locally |
Model selection tip: Use a fast, cheap model (Claude Haiku, GPT-4o mini) for simple factual Q&A. Use a powerful model (Claude Sonnet/Opus, GPT-4o) for complex analysis, synthesis, and multi-step reasoning.
What's Next?
The next lesson covers evaluation — how to measure and improve the quality of your RAG system.
Lilly Tech Systems