Intermediate

RAG & Retrieval

Retrieval-Augmented Generation (RAG) is the most common LLM application pattern. This lesson covers the complete RAG pipeline: loading documents, splitting text, creating embeddings, storing in vector databases, retrieving relevant chunks, and generating answers. Includes practice questions.

Document Loaders

# Document Loaders - Loading data from various sources
from langchain_community.document_loaders import (
    PyPDFLoader,
    WebBaseLoader,
    CSVLoader,
    TextLoader,
    DirectoryLoader,
    UnstructuredMarkdownLoader
)

# Load PDF
loader = PyPDFLoader("document.pdf")
docs = loader.load()  # Returns list of Document objects
# Each Document has: page_content (str) and metadata (dict)

# Load web page
loader = WebBaseLoader("https://example.com/article")
docs = loader.load()

# Load CSV
loader = CSVLoader("data.csv", csv_args={"delimiter": ","})
docs = loader.load()

# Load entire directory
loader = DirectoryLoader("./docs/", glob="**/*.md",
                         loader_cls=UnstructuredMarkdownLoader)
docs = loader.load()

# Document structure:
# doc.page_content = "The actual text content..."
# doc.metadata = {"source": "file.pdf", "page": 0, ...}

Text Splitters

# Text Splitters - Breaking documents into chunks
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)

# RecursiveCharacterTextSplitter (recommended - most versatile)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Max characters per chunk
    chunk_overlap=200,      # Overlap between chunks
    separators=["\n\n", "\n", ". ", " ", ""],  # Split priority
    length_function=len
)

chunks = splitter.split_documents(docs)

# Key concepts for the exam:
splitter_concepts = {
    "chunk_size": "Maximum size of each chunk. Too small = lost context. Too large = noisy retrieval",
    "chunk_overlap": "Overlap ensures context is not lost at chunk boundaries. Typically 10-20% of chunk_size",
    "separators": "RecursiveCharacter tries each separator in order, preferring natural breaks",
    "Typical values": "chunk_size=500-1500, chunk_overlap=50-200 depending on use case"
}

# TokenTextSplitter - split by token count (useful for LLM context limits)
splitter = TokenTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

Vector Stores & Embeddings

# Embeddings and Vector Stores
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS, Chroma

# Create embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create vector store from documents
vectorstore = FAISS.from_documents(chunks, embeddings)

# Or with Chroma (persistent)
vectorstore = Chroma.from_documents(
    chunks, embeddings,
    persist_directory="./chroma_db",
    collection_name="my_docs"
)

# Add documents later
vectorstore.add_documents(new_chunks)

# Direct similarity search
results = vectorstore.similarity_search("What is RAG?", k=4)
# Returns top 4 most similar documents

# Similarity search with scores
results = vectorstore.similarity_search_with_score("What is RAG?", k=4)
# Returns list of (Document, float) tuples

# Save and load FAISS
vectorstore.save_local("./faiss_index")
loaded_vs = FAISS.load_local("./faiss_index", embeddings,
                              allow_dangerous_deserialization=True)

# Vector store comparison:
vector_stores = {
    "FAISS": "In-memory, fast, no server needed, good for prototyping",
    "Chroma": "Persistent, embedded database, good for small-medium projects",
    "Pinecone": "Cloud-hosted, scalable, managed service, production-ready",
    "Weaviate": "Cloud or self-hosted, supports hybrid search",
    "Qdrant": "High-performance, supports filtering, cloud or self-hosted"
}

Retrievers & RAG Chain

# Retrievers and RAG chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Create retriever from vector store
retriever = vectorstore.as_retriever(
    search_type="similarity",      # "similarity" or "mmr"
    search_kwargs={"k": 4}         # Number of documents to retrieve
)

# MMR retriever (Maximum Marginal Relevance - more diverse results)
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 20}  # Fetch 20, select 4 diverse
)

# Build RAG chain with LCEL
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer based on the context below. If you cannot find "
               "the answer in the context, say 'I don't know.'\n\n"
               "Context: {context}"),
    ("human", "{question}")
])

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

answer = rag_chain.invoke("What is retrieval augmented generation?")

# Retriever search types:
search_types = {
    "similarity": "Returns k most similar documents by cosine similarity",
    "mmr": "Maximum Marginal Relevance - balances relevance and diversity",
    "similarity_score_threshold": "Only return docs above a score threshold"
}

# With score threshold
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.8}
)

Practice Questions

💡
Test your knowledge of RAG and retrieval:
Q1: Why is RecursiveCharacterTextSplitter preferred over CharacterTextSplitter?

Answer: RecursiveCharacterTextSplitter tries multiple separators in priority order (paragraphs, newlines, sentences, words) to find the most natural split points. CharacterTextSplitter uses only a single separator. The recursive approach preserves semantic meaning better because it prefers splitting at paragraph and sentence boundaries rather than mid-sentence.

Q2: What is chunk_overlap and why is it important?

Answer: chunk_overlap specifies how many characters are shared between adjacent chunks. It prevents information loss at chunk boundaries. Without overlap, a relevant sentence split across two chunks might not be fully retrieved. Typical values are 10-20% of chunk_size (e.g., chunk_size=1000, chunk_overlap=200).

Q3: What is the difference between similarity search and MMR retrieval?

Answer: Similarity search returns the k most similar documents, which may be redundant if they cover the same content. MMR (Maximum Marginal Relevance) balances relevance and diversity by first fetching more documents (fetch_k) and then selecting k that are both relevant and dissimilar to each other. Use MMR when you want broader coverage of the topic.

Q4: In a RAG chain, what does RunnablePassthrough() do for the question?

Answer: In the pattern {"context": retriever | format_docs, "question": RunnablePassthrough()}, RunnablePassthrough() passes the original user query through unchanged as the "question" key. Meanwhile, the same query is also sent to the retriever to fetch context. This allows both the retrieved context and the original question to reach the prompt template.

Q5: When should you use FAISS vs Chroma vs Pinecone?

Answer: Use FAISS for prototyping and testing (in-memory, no server needed, fastest setup). Use Chroma for small-to-medium projects that need persistence (embedded database, easy to set up). Use Pinecone for production at scale (cloud-hosted, managed infrastructure, scales to millions of vectors, filtering support).

Key Takeaways

💡
  • RAG pipeline: Load documents → Split text → Embed → Store in vector DB → Retrieve → Generate
  • Use RecursiveCharacterTextSplitter with appropriate chunk_size (500-1500) and overlap (10-20%)
  • FAISS for prototyping, Chroma for persistence, Pinecone for production
  • MMR retrieval provides more diverse results than simple similarity search
  • The LCEL RAG pattern uses RunnablePassthrough to pass the question alongside retrieved context