Intermediate

RAG Applications (LLM + Embedding)

Retrieval-Augmented Generation is the most common and impactful multi-model pattern. By combining embedding models for search with LLMs for generation, RAG lets you build AI applications grounded in your own data — eliminating hallucination and knowledge cutoff issues.

What is RAG?

Retrieval-Augmented Generation (RAG) is an architecture that enhances LLM responses by first retrieving relevant information from an external knowledge base, then providing that information as context for the LLM to generate its answer.

Instead of relying solely on what the LLM learned during training, RAG gives the model access to your specific documents, databases, and knowledge — producing responses that are accurate, up-to-date, and verifiable.

💡

Why "Retrieval-Augmented"? The generation (LLM response) is augmented with retrieved context. The LLM doesn't guess — it reads the relevant documents and synthesizes an answer based on what it found. Think of it as giving the LLM an open-book exam instead of a closed-book one.

Why RAG: Solving LLM Limitations

RAG directly addresses the three biggest limitations of standalone LLMs:

LLM Limitation	Problem	How RAG Solves It
Hallucination	LLMs confidently generate plausible but incorrect information	RAG grounds responses in retrieved documents. The LLM cites specific sources, and you can verify claims against the original text.
Knowledge Cutoff	LLMs only know information up to their training date	RAG retrieves from a continuously updated knowledge base. New documents are indexed immediately and available for queries.
No Domain Expertise	LLMs lack knowledge of your company's internal docs, policies, and data	RAG indexes your private documents — Confluence pages, Slack threads, code repos, internal wikis — making them searchable and referenceable.

⚠

RAG doesn't eliminate hallucination entirely. An LLM can still hallucinate even with context. The key is that RAG dramatically reduces hallucination and makes it detectable — you can compare the response against the retrieved sources to verify accuracy.

RAG Architecture

A RAG system has two phases: indexing (offline, one-time) and querying (online, per-request). Here's the complete data flow:

RAG Architecture - Complete Data Flow

═══ INDEXING PHASE (offline) ═══

Documents ──→ [Document Loader] ──→ Raw Text
                                         │
                                    [Text Splitter]
                                         │
                                      Chunks (e.g., 512 tokens each)
                                         │
                                   [Embedding Model]
                                         │
                                    Dense Vectors (e.g., 1536-dim)
                                         │
                                   [Vector Database]  ──→  Stored & Indexed
                                   (with metadata)


═══ QUERY PHASE (online, per request) ═══

User Query ──→ [Embedding Model] ──→ Query Vector
                                            │
                                      [Vector Search]  (top-k similar)
                                            │
                                      Retrieved Chunks (k=5-20)
                                            │
                                       [Reranker]  (optional but recommended)
                                            │
                                      Top Chunks (k=3-5, reranked)
                                            │
                               ┌────────────┴────────────┐
                               │     PROMPT TEMPLATE      │
                               │  System: You are a ...   │
                               │  Context: {chunks}       │
                               │  Question: {query}       │
                               └────────────┬────────────┘
                                            │
                                        [LLM]
                                            │
                                     Generated Answer
                                    (with source citations)

RAG Components Deep Dive

1. Document Loaders

Document loaders extract text from various source formats. The quality of your RAG system starts here — garbage in, garbage out.

Python - Common Document Loaders

from langchain_community.document_loaders import (
    PyPDFLoader,           # PDF files
    UnstructuredMarkdownLoader,  # Markdown files
    CSVLoader,             # CSV/spreadsheet data
    WebBaseLoader,         # Web pages
    GitLoader,             # Git repositories
    NotionDirectoryLoader, # Notion exports
    ConfluenceLoader,      # Confluence pages (via API)
    SlackDirectoryLoader,  # Slack message exports
)

# Example: Load a PDF
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages")

2. Text Splitters (Chunking)

Documents must be split into smaller chunks that fit within embedding model context windows and provide focused, retrievable units of information.

Strategy	How It Works	Best For	Typical Size
Fixed-Size	Split every N characters/tokens with overlap	Simple docs, unstructured text	500-1000 tokens, 100-200 overlap
Recursive	Split on paragraphs, then sentences, then words	General-purpose, most common choice	500-1000 tokens
Semantic	Use embeddings to detect topic shifts and split at boundaries	Long documents with topic changes	Variable (200-1500 tokens)
Document-Aware	Split on markdown headers, HTML tags, or code blocks	Structured docs (markdown, HTML, code)	Variable per section

Python - Chunking Strategies

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter,
    TokenTextSplitter,
)

# Recursive splitting (recommended default)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Max characters per chunk
    chunk_overlap=200,      # Overlap between chunks
    separators=["\n\n", "\n", ". ", " ", ""]  # Priority order
)
chunks = splitter.split_documents(documents)

# Markdown-aware splitting (for structured docs)
headers_to_split = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split)
md_chunks = md_splitter.split_text(markdown_text)

print(f"Split into {len(chunks)} chunks")

✅

Chunking best practice: Start with RecursiveCharacterTextSplitter at 1000 chars with 200 overlap. This works well for 80% of use cases. Only move to semantic or document-aware chunking when you've validated that simple chunking isn't giving good retrieval results.

3. Embedding Models

Embedding models convert text into dense vector representations that capture semantic meaning. Similar texts produce vectors that are close together in the vector space.

Model	Provider	Dimensions	Max Tokens	Cost (per 1M tokens)	Best For
text-embedding-3-large	OpenAI	3072	8191	$0.13	Highest accuracy, enterprise
text-embedding-3-small	OpenAI	1536	8191	$0.02	Good balance of cost/quality
embed-v3	Cohere	1024	512	$0.10	Multilingual, search-optimized
BGE-large-en-v1.5	BAAI (open)	1024	512	Free (self-hosted)	Open-source, self-hosted
E5-large-v2	Microsoft (open)	1024	512	Free (self-hosted)	Open-source, multilingual
voyage-3	Voyage AI	1024	16000	$0.06	Code search, long context

4. Vector Stores

Vector databases store embeddings and enable fast similarity search. Choosing the right one depends on your scale and infrastructure preferences.

Database	Type	Max Vectors	Hybrid Search	Pricing	Best For
Pinecone	Managed cloud	Billions	Yes	Free tier + pay-per-use	Production, zero-ops, fast scaling
Weaviate	Open / managed	Billions	Yes	Open-source + cloud option	Hybrid search, rich filtering
ChromaDB	Open-source	Millions	No	Free	Prototyping, local dev, simple API
Qdrant	Open / managed	Billions	Yes	Open-source + cloud option	Performance, Rust engine, filtering
pgvector	Postgres ext.	Millions	Yes (with FTS)	Free (existing Postgres)	Teams already using PostgreSQL
Milvus	Open-source	Billions	Yes	Open-source + Zilliz Cloud	Large-scale, GPU-accelerated

Full RAG Pipeline: LangChain + ChromaDB + OpenAI

Here's a complete, working RAG pipeline you can run locally:

Python - Complete RAG Pipeline

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# ── Step 1: Load Documents ──
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages")

# ── Step 2: Chunk Documents ──
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")

# ── Step 3: Create Embeddings & Store in ChromaDB ──
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="company_docs",
)
print("Vector store created and persisted")

# ── Step 4: Create Retriever ──
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5},  # Return top 5 chunks
)

# ── Step 5: Create RAG Prompt ──
rag_prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant that answers questions based on the provided context.
Use ONLY the context below to answer. If the context doesn't contain the answer,
say "I don't have enough information to answer this question."

Always cite which document/section your answer comes from.

Context:
{context}

Question: {question}

Answer:""")

# ── Step 6: Create LLM ──
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# ── Step 7: Build RAG Chain ──
def format_docs(docs):
    return "\n\n---\n\n".join(
        f"[Source: {doc.metadata.get('source', 'unknown')}, "
        f"Page {doc.metadata.get('page', '?')}]\n{doc.page_content}"
        for doc in docs
    )

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

# ── Step 8: Query! ──
answer = rag_chain.invoke("What is our company's remote work policy?")
print(answer)

✅

Install dependencies: pip install langchain langchain-openai langchain-chroma langchain-community pypdf chromadb

RAG with Claude API

Here's the same concept using the Anthropic Claude API directly, without LangChain:

Python - RAG with Claude (Direct API)

import anthropic
import chromadb
from openai import OpenAI

# ── Initialize clients ──
claude = anthropic.Anthropic()
openai_client = OpenAI()
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("company_docs")

# ── Embed and store documents (one-time) ──
def index_documents(texts, metadatas, ids):
    """Embed texts and store in ChromaDB."""
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    embeddings = [item.embedding for item in response.data]
    collection.add(
        documents=texts,
        embeddings=embeddings,
        metadatas=metadatas,
        ids=ids,
    )

# ── Retrieve relevant chunks ──
def retrieve(query, top_k=5):
    """Embed query and search for similar chunks."""
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=[query],
    )
    query_embedding = response.data[0].embedding
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
    )
    return results

# ── Generate answer with Claude ──
def rag_query(question):
    """Full RAG pipeline: retrieve context, then generate with Claude."""
    # Retrieve relevant chunks
    results = retrieve(question)
    context_chunks = results["documents"][0]
    metadatas = results["metadatas"][0]

    # Format context with source info
    context = "\n\n---\n\n".join(
        f"[Source: {m.get('source', 'unknown')}]\n{chunk}"
        for chunk, m in zip(context_chunks, metadatas)
    )

    # Generate with Claude
    message = claude.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system="""You are a helpful assistant. Answer questions using ONLY
the provided context. Cite your sources. If the context doesn't
contain enough information, say so clearly.""",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}",
        }],
    )
    return message.content[0].text

# ── Use it ──
answer = rag_query("What are the vacation day policies?")
print(answer)

Advanced RAG Techniques

Basic RAG gets you 70-80% of the way. These advanced techniques push quality significantly higher:

Hybrid Search

Combine dense vector search (semantic) with sparse keyword search (BM25) for better recall. Semantic search finds conceptually similar content, while keyword search catches exact term matches that embeddings might miss.

Python - Hybrid Search with Weaviate

import weaviate

client = weaviate.connect_to_local()
collection = client.collections.get("Documents")

# Hybrid search: combines BM25 (keyword) + vector (semantic)
results = collection.query.hybrid(
    query="remote work policy for engineering team",
    alpha=0.7,   # 0.7 = 70% semantic, 30% keyword
    limit=10,
    return_metadata=weaviate.classes.query.MetadataQuery(score=True),
)

for obj in results.objects:
    print(f"Score: {obj.metadata.score:.3f} | {obj.properties['text'][:100]}")

Reranking

After initial retrieval, a reranking model re-scores the results for more precise relevance. This is one of the highest-impact improvements you can make to a RAG system.

Python - Reranking with Cohere

import cohere

co = cohere.Client()

# Initial retrieval returns 20 candidates
initial_results = retriever.get_relevant_documents(query, k=20)

# Rerank to find the best 5
rerank_response = co.rerank(
    model="rerank-v3.5",
    query=query,
    documents=[doc.page_content for doc in initial_results],
    top_n=5,
)

# Use reranked results for LLM context
reranked_docs = [initial_results[r.index] for r in rerank_response.results]
for r in rerank_response.results:
    print(f"Score: {r.relevance_score:.3f} | Index: {r.index}")

💡

Why reranking works so well: Embedding models optimize for fast approximate matching across millions of vectors. Reranking models do expensive pairwise comparison on a small set (10-20 candidates), giving much more accurate relevance scores. Retrieve wide (top-20), rerank narrow (top-5).

HyDE (Hypothetical Document Embeddings)

Instead of embedding the raw query, ask an LLM to generate a hypothetical answer, then embed that. The hypothetical answer is closer in embedding space to the actual documents than a short query.

Python - HyDE Implementation

def hyde_retrieve(query, retriever, llm):
    """Generate hypothetical answer, embed it, search with that."""
    # Step 1: Generate hypothetical answer
    hypothetical = llm.invoke(
        f"Write a detailed paragraph that would answer this question: {query}"
    )

    # Step 2: Use hypothetical answer as the search query
    # This embeds the hypothetical answer instead of the short query
    results = retriever.get_relevant_documents(hypothetical.content)
    return results

Query Decomposition

For complex questions, break them into sub-questions, retrieve for each independently, then synthesize:

Python - Query Decomposition

def decompose_and_retrieve(complex_query, llm, retriever):
    """Break complex query into sub-queries, retrieve for each."""
    # Decompose
    decomposition = llm.invoke(
        f"""Break this complex question into 2-4 simpler sub-questions
that can be answered independently:

Question: {complex_query}

Return each sub-question on a new line, numbered."""
    )
    sub_queries = decomposition.content.strip().split("\n")

    # Retrieve for each sub-query
    all_docs = []
    for sub_q in sub_queries:
        docs = retriever.get_relevant_documents(sub_q)
        all_docs.extend(docs)

    # Deduplicate
    seen = set()
    unique_docs = []
    for doc in all_docs:
        if doc.page_content not in seen:
            seen.add(doc.page_content)
            unique_docs.append(doc)

    return unique_docs

Evaluating RAG: The RAGAS Framework

You can't improve what you can't measure. RAGAS (Retrieval Augmented Generation Assessment) provides standardized metrics for evaluating RAG systems:

Metric	What It Measures	Score Range	How to Improve
Faithfulness	Is the answer factually consistent with the retrieved context?	0-1 (higher = better)	Better prompts, stricter context-only instructions
Answer Relevance	Does the answer actually address the question asked?	0-1 (higher = better)	Better LLM, improved prompt engineering
Context Precision	Are the retrieved chunks actually relevant to the question?	0-1 (higher = better)	Better chunking, reranking, hybrid search
Context Recall	Does the retrieved context cover all aspects of the ground truth answer?	0-1 (higher = better)	Retrieve more chunks, better embeddings

Python - RAGAS Evaluation

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["What is the remote work policy?", ...],
    "answer": [rag_answer_1, ...],          # RAG system outputs
    "contexts": [[chunk_1, chunk_2], ...],  # Retrieved chunks
    "ground_truth": ["Employees can work...", ...],  # Expected answers
}
dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.85, 'context_recall': 0.79}

Production Considerations

Moving RAG from a notebook to production requires addressing several additional concerns:

Caching

Cache at multiple levels to reduce cost and latency:

Embedding cache: Don't re-embed the same query twice. Use Redis or an in-memory LRU cache.
Retrieval cache: Cache vector search results for repeated queries (with a TTL for freshness).
LLM response cache: Cache complete responses for identical query + context combinations. Semantic caching (using embeddings to match similar queries) extends cache hit rates.

Streaming

Stream the LLM response as it generates. Users see the first tokens in <500ms instead of waiting 3-5 seconds for the full response:

Python - Streaming RAG Response

# Stream with Anthropic Claude
with claude.messages.stream(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=system_prompt,
    messages=[{"role": "user", "content": prompt_with_context}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Document Refresh

Knowledge bases change. You need a strategy for keeping your index current:

Full re-index: Periodic complete rebuild. Simple but expensive for large corpora.
Incremental updates: Track document hashes, only re-embed changed or new documents.
Real-time sync: Webhook-driven updates when source documents change (Confluence, Notion, etc.).

Access Control

In enterprise RAG, not every user should see every document. Implement access control at the retrieval layer:

Tag each chunk with permission metadata (team, role, classification level)
Filter vector search results by the querying user's permissions
Audit which documents were used to generate each response

⚠

Security note: Never rely on the LLM to enforce access control. An LLM can be prompt-injected into ignoring access rules. Filtering must happen at the retrieval/database layer before context reaches the LLM.

Summary

RAG is the foundational multi-model pattern. Here's what to remember:

RAG = Embedding Model + Vector Store + LLM — three models working together
Start simple: Recursive chunking + OpenAI embeddings + ChromaDB + Claude/GPT-4
Measure first: Use RAGAS to establish baseline metrics before optimizing
Add reranking: The single highest-impact improvement for retrieval quality
Optimize chunking: Your chunk strategy matters more than your embedding model choice
Cache aggressively: Embedding and retrieval caching dramatically reduce costs
Stream always: Users perceive streamed responses as faster even when total time is the same

← Previous Introduction Next → Document Processing