Intermediate

RAG Applications (LLM + Embedding)

Retrieval-Augmented Generation is the most common and impactful multi-model pattern. By combining embedding models for search with LLMs for generation, RAG lets you build AI applications grounded in your own data — eliminating hallucination and knowledge cutoff issues.

What is RAG?

Retrieval-Augmented Generation (RAG) is an architecture that enhances LLM responses by first retrieving relevant information from an external knowledge base, then providing that information as context for the LLM to generate its answer.

Instead of relying solely on what the LLM learned during training, RAG gives the model access to your specific documents, databases, and knowledge — producing responses that are accurate, up-to-date, and verifiable.

💡
Why "Retrieval-Augmented"? The generation (LLM response) is augmented with retrieved context. The LLM doesn't guess — it reads the relevant documents and synthesizes an answer based on what it found. Think of it as giving the LLM an open-book exam instead of a closed-book one.

Why RAG: Solving LLM Limitations

RAG directly addresses the three biggest limitations of standalone LLMs:

LLM LimitationProblemHow RAG Solves It
HallucinationLLMs confidently generate plausible but incorrect informationRAG grounds responses in retrieved documents. The LLM cites specific sources, and you can verify claims against the original text.
Knowledge CutoffLLMs only know information up to their training dateRAG retrieves from a continuously updated knowledge base. New documents are indexed immediately and available for queries.
No Domain ExpertiseLLMs lack knowledge of your company's internal docs, policies, and dataRAG indexes your private documents — Confluence pages, Slack threads, code repos, internal wikis — making them searchable and referenceable.
RAG doesn't eliminate hallucination entirely. An LLM can still hallucinate even with context. The key is that RAG dramatically reduces hallucination and makes it detectable — you can compare the response against the retrieved sources to verify accuracy.

RAG Architecture

A RAG system has two phases: indexing (offline, one-time) and querying (online, per-request). Here's the complete data flow:

RAG Architecture - Complete Data Flow
═══ INDEXING PHASE (offline) ═══

Documents ──→ [Document Loader] ──→ Raw Text
                                         │
                                    [Text Splitter]
                                         │
                                      Chunks (e.g., 512 tokens each)
                                         │
                                   [Embedding Model]
                                         │
                                    Dense Vectors (e.g., 1536-dim)
                                         │
                                   [Vector Database]  ──→  Stored & Indexed
                                   (with metadata)


═══ QUERY PHASE (online, per request) ═══

User Query ──→ [Embedding Model] ──→ Query Vector
                                            │
                                      [Vector Search]  (top-k similar)
                                            │
                                      Retrieved Chunks (k=5-20)
                                            │
                                       [Reranker]  (optional but recommended)
                                            │
                                      Top Chunks (k=3-5, reranked)
                                            │
                               ┌────────────┴────────────┐
                               │     PROMPT TEMPLATE      │
                               │  System: You are a ...   │
                               │  Context: {chunks}       │
                               │  Question: {query}       │
                               └────────────┬────────────┘
                                            │
                                        [LLM]
                                            │
                                     Generated Answer
                                    (with source citations)

RAG Components Deep Dive

1. Document Loaders

Document loaders extract text from various source formats. The quality of your RAG system starts here — garbage in, garbage out.

Python - Common Document Loaders
from langchain_community.document_loaders import (
    PyPDFLoader,           # PDF files
    UnstructuredMarkdownLoader,  # Markdown files
    CSVLoader,             # CSV/spreadsheet data
    WebBaseLoader,         # Web pages
    GitLoader,             # Git repositories
    NotionDirectoryLoader, # Notion exports
    ConfluenceLoader,      # Confluence pages (via API)
    SlackDirectoryLoader,  # Slack message exports
)

# Example: Load a PDF
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages")

2. Text Splitters (Chunking)

Documents must be split into smaller chunks that fit within embedding model context windows and provide focused, retrievable units of information.

StrategyHow It WorksBest ForTypical Size
Fixed-SizeSplit every N characters/tokens with overlapSimple docs, unstructured text500-1000 tokens, 100-200 overlap
RecursiveSplit on paragraphs, then sentences, then wordsGeneral-purpose, most common choice500-1000 tokens
SemanticUse embeddings to detect topic shifts and split at boundariesLong documents with topic changesVariable (200-1500 tokens)
Document-AwareSplit on markdown headers, HTML tags, or code blocksStructured docs (markdown, HTML, code)Variable per section
Python - Chunking Strategies
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter,
    TokenTextSplitter,
)

# Recursive splitting (recommended default)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Max characters per chunk
    chunk_overlap=200,      # Overlap between chunks
    separators=["\n\n", "\n", ". ", " ", ""]  # Priority order
)
chunks = splitter.split_documents(documents)

# Markdown-aware splitting (for structured docs)
headers_to_split = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split)
md_chunks = md_splitter.split_text(markdown_text)

print(f"Split into {len(chunks)} chunks")
Chunking best practice: Start with RecursiveCharacterTextSplitter at 1000 chars with 200 overlap. This works well for 80% of use cases. Only move to semantic or document-aware chunking when you've validated that simple chunking isn't giving good retrieval results.

3. Embedding Models

Embedding models convert text into dense vector representations that capture semantic meaning. Similar texts produce vectors that are close together in the vector space.

ModelProviderDimensionsMax TokensCost (per 1M tokens)Best For
text-embedding-3-largeOpenAI30728191$0.13Highest accuracy, enterprise
text-embedding-3-smallOpenAI15368191$0.02Good balance of cost/quality
embed-v3Cohere1024512$0.10Multilingual, search-optimized
BGE-large-en-v1.5BAAI (open)1024512Free (self-hosted)Open-source, self-hosted
E5-large-v2Microsoft (open)1024512Free (self-hosted)Open-source, multilingual
voyage-3Voyage AI102416000$0.06Code search, long context

4. Vector Stores

Vector databases store embeddings and enable fast similarity search. Choosing the right one depends on your scale and infrastructure preferences.

DatabaseTypeMax VectorsHybrid SearchPricingBest For
PineconeManaged cloudBillionsYesFree tier + pay-per-useProduction, zero-ops, fast scaling
WeaviateOpen / managedBillionsYesOpen-source + cloud optionHybrid search, rich filtering
ChromaDBOpen-sourceMillionsNoFreePrototyping, local dev, simple API
QdrantOpen / managedBillionsYesOpen-source + cloud optionPerformance, Rust engine, filtering
pgvectorPostgres ext.MillionsYes (with FTS)Free (existing Postgres)Teams already using PostgreSQL
MilvusOpen-sourceBillionsYesOpen-source + Zilliz CloudLarge-scale, GPU-accelerated

Full RAG Pipeline: LangChain + ChromaDB + OpenAI

Here's a complete, working RAG pipeline you can run locally:

Python - Complete RAG Pipeline
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# ── Step 1: Load Documents ──
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages")

# ── Step 2: Chunk Documents ──
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")

# ── Step 3: Create Embeddings & Store in ChromaDB ──
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="company_docs",
)
print("Vector store created and persisted")

# ── Step 4: Create Retriever ──
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5},  # Return top 5 chunks
)

# ── Step 5: Create RAG Prompt ──
rag_prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant that answers questions based on the provided context.
Use ONLY the context below to answer. If the context doesn't contain the answer,
say "I don't have enough information to answer this question."

Always cite which document/section your answer comes from.

Context:
{context}

Question: {question}

Answer:""")

# ── Step 6: Create LLM ──
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# ── Step 7: Build RAG Chain ──
def format_docs(docs):
    return "\n\n---\n\n".join(
        f"[Source: {doc.metadata.get('source', 'unknown')}, "
        f"Page {doc.metadata.get('page', '?')}]\n{doc.page_content}"
        for doc in docs
    )

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

# ── Step 8: Query! ──
answer = rag_chain.invoke("What is our company's remote work policy?")
print(answer)
Install dependencies: pip install langchain langchain-openai langchain-chroma langchain-community pypdf chromadb

RAG with Claude API

Here's the same concept using the Anthropic Claude API directly, without LangChain:

Python - RAG with Claude (Direct API)
import anthropic
import chromadb
from openai import OpenAI

# ── Initialize clients ──
claude = anthropic.Anthropic()
openai_client = OpenAI()
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("company_docs")

# ── Embed and store documents (one-time) ──
def index_documents(texts, metadatas, ids):
    """Embed texts and store in ChromaDB."""
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    embeddings = [item.embedding for item in response.data]
    collection.add(
        documents=texts,
        embeddings=embeddings,
        metadatas=metadatas,
        ids=ids,
    )

# ── Retrieve relevant chunks ──
def retrieve(query, top_k=5):
    """Embed query and search for similar chunks."""
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=[query],
    )
    query_embedding = response.data[0].embedding
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
    )
    return results

# ── Generate answer with Claude ──
def rag_query(question):
    """Full RAG pipeline: retrieve context, then generate with Claude."""
    # Retrieve relevant chunks
    results = retrieve(question)
    context_chunks = results["documents"][0]
    metadatas = results["metadatas"][0]

    # Format context with source info
    context = "\n\n---\n\n".join(
        f"[Source: {m.get('source', 'unknown')}]\n{chunk}"
        for chunk, m in zip(context_chunks, metadatas)
    )

    # Generate with Claude
    message = claude.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system="""You are a helpful assistant. Answer questions using ONLY
the provided context. Cite your sources. If the context doesn't
contain enough information, say so clearly.""",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}",
        }],
    )
    return message.content[0].text

# ── Use it ──
answer = rag_query("What are the vacation day policies?")
print(answer)

Advanced RAG Techniques

Basic RAG gets you 70-80% of the way. These advanced techniques push quality significantly higher:

Hybrid Search

Combine dense vector search (semantic) with sparse keyword search (BM25) for better recall. Semantic search finds conceptually similar content, while keyword search catches exact term matches that embeddings might miss.

Python - Hybrid Search with Weaviate
import weaviate

client = weaviate.connect_to_local()
collection = client.collections.get("Documents")

# Hybrid search: combines BM25 (keyword) + vector (semantic)
results = collection.query.hybrid(
    query="remote work policy for engineering team",
    alpha=0.7,   # 0.7 = 70% semantic, 30% keyword
    limit=10,
    return_metadata=weaviate.classes.query.MetadataQuery(score=True),
)

for obj in results.objects:
    print(f"Score: {obj.metadata.score:.3f} | {obj.properties['text'][:100]}")

Reranking

After initial retrieval, a reranking model re-scores the results for more precise relevance. This is one of the highest-impact improvements you can make to a RAG system.

Python - Reranking with Cohere
import cohere

co = cohere.Client()

# Initial retrieval returns 20 candidates
initial_results = retriever.get_relevant_documents(query, k=20)

# Rerank to find the best 5
rerank_response = co.rerank(
    model="rerank-v3.5",
    query=query,
    documents=[doc.page_content for doc in initial_results],
    top_n=5,
)

# Use reranked results for LLM context
reranked_docs = [initial_results[r.index] for r in rerank_response.results]
for r in rerank_response.results:
    print(f"Score: {r.relevance_score:.3f} | Index: {r.index}")
💡
Why reranking works so well: Embedding models optimize for fast approximate matching across millions of vectors. Reranking models do expensive pairwise comparison on a small set (10-20 candidates), giving much more accurate relevance scores. Retrieve wide (top-20), rerank narrow (top-5).

HyDE (Hypothetical Document Embeddings)

Instead of embedding the raw query, ask an LLM to generate a hypothetical answer, then embed that. The hypothetical answer is closer in embedding space to the actual documents than a short query.

Python - HyDE Implementation
def hyde_retrieve(query, retriever, llm):
    """Generate hypothetical answer, embed it, search with that."""
    # Step 1: Generate hypothetical answer
    hypothetical = llm.invoke(
        f"Write a detailed paragraph that would answer this question: {query}"
    )

    # Step 2: Use hypothetical answer as the search query
    # This embeds the hypothetical answer instead of the short query
    results = retriever.get_relevant_documents(hypothetical.content)
    return results

Query Decomposition

For complex questions, break them into sub-questions, retrieve for each independently, then synthesize:

Python - Query Decomposition
def decompose_and_retrieve(complex_query, llm, retriever):
    """Break complex query into sub-queries, retrieve for each."""
    # Decompose
    decomposition = llm.invoke(
        f"""Break this complex question into 2-4 simpler sub-questions
that can be answered independently:

Question: {complex_query}

Return each sub-question on a new line, numbered."""
    )
    sub_queries = decomposition.content.strip().split("\n")

    # Retrieve for each sub-query
    all_docs = []
    for sub_q in sub_queries:
        docs = retriever.get_relevant_documents(sub_q)
        all_docs.extend(docs)

    # Deduplicate
    seen = set()
    unique_docs = []
    for doc in all_docs:
        if doc.page_content not in seen:
            seen.add(doc.page_content)
            unique_docs.append(doc)

    return unique_docs

Evaluating RAG: The RAGAS Framework

You can't improve what you can't measure. RAGAS (Retrieval Augmented Generation Assessment) provides standardized metrics for evaluating RAG systems:

MetricWhat It MeasuresScore RangeHow to Improve
FaithfulnessIs the answer factually consistent with the retrieved context?0-1 (higher = better)Better prompts, stricter context-only instructions
Answer RelevanceDoes the answer actually address the question asked?0-1 (higher = better)Better LLM, improved prompt engineering
Context PrecisionAre the retrieved chunks actually relevant to the question?0-1 (higher = better)Better chunking, reranking, hybrid search
Context RecallDoes the retrieved context cover all aspects of the ground truth answer?0-1 (higher = better)Retrieve more chunks, better embeddings
Python - RAGAS Evaluation
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["What is the remote work policy?", ...],
    "answer": [rag_answer_1, ...],          # RAG system outputs
    "contexts": [[chunk_1, chunk_2], ...],  # Retrieved chunks
    "ground_truth": ["Employees can work...", ...],  # Expected answers
}
dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.85, 'context_recall': 0.79}

Production Considerations

Moving RAG from a notebook to production requires addressing several additional concerns:

Caching

Cache at multiple levels to reduce cost and latency:

  • Embedding cache: Don't re-embed the same query twice. Use Redis or an in-memory LRU cache.
  • Retrieval cache: Cache vector search results for repeated queries (with a TTL for freshness).
  • LLM response cache: Cache complete responses for identical query + context combinations. Semantic caching (using embeddings to match similar queries) extends cache hit rates.

Streaming

Stream the LLM response as it generates. Users see the first tokens in <500ms instead of waiting 3-5 seconds for the full response:

Python - Streaming RAG Response
# Stream with Anthropic Claude
with claude.messages.stream(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=system_prompt,
    messages=[{"role": "user", "content": prompt_with_context}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Document Refresh

Knowledge bases change. You need a strategy for keeping your index current:

  • Full re-index: Periodic complete rebuild. Simple but expensive for large corpora.
  • Incremental updates: Track document hashes, only re-embed changed or new documents.
  • Real-time sync: Webhook-driven updates when source documents change (Confluence, Notion, etc.).

Access Control

In enterprise RAG, not every user should see every document. Implement access control at the retrieval layer:

  • Tag each chunk with permission metadata (team, role, classification level)
  • Filter vector search results by the querying user's permissions
  • Audit which documents were used to generate each response
Security note: Never rely on the LLM to enforce access control. An LLM can be prompt-injected into ignoring access rules. Filtering must happen at the retrieval/database layer before context reaches the LLM.

Summary

RAG is the foundational multi-model pattern. Here's what to remember:

  • RAG = Embedding Model + Vector Store + LLM — three models working together
  • Start simple: Recursive chunking + OpenAI embeddings + ChromaDB + Claude/GPT-4
  • Measure first: Use RAGAS to establish baseline metrics before optimizing
  • Add reranking: The single highest-impact improvement for retrieval quality
  • Optimize chunking: Your chunk strategy matters more than your embedding model choice
  • Cache aggressively: Embedding and retrieval caching dramatically reduce costs
  • Stream always: Users perceive streamed responses as faster even when total time is the same