RAG Applications (LLM + Embedding)
Retrieval-Augmented Generation is the most common and impactful multi-model pattern. By combining embedding models for search with LLMs for generation, RAG lets you build AI applications grounded in your own data — eliminating hallucination and knowledge cutoff issues.
What is RAG?
Retrieval-Augmented Generation (RAG) is an architecture that enhances LLM responses by first retrieving relevant information from an external knowledge base, then providing that information as context for the LLM to generate its answer.
Instead of relying solely on what the LLM learned during training, RAG gives the model access to your specific documents, databases, and knowledge — producing responses that are accurate, up-to-date, and verifiable.
Why RAG: Solving LLM Limitations
RAG directly addresses the three biggest limitations of standalone LLMs:
| LLM Limitation | Problem | How RAG Solves It |
|---|---|---|
| Hallucination | LLMs confidently generate plausible but incorrect information | RAG grounds responses in retrieved documents. The LLM cites specific sources, and you can verify claims against the original text. |
| Knowledge Cutoff | LLMs only know information up to their training date | RAG retrieves from a continuously updated knowledge base. New documents are indexed immediately and available for queries. |
| No Domain Expertise | LLMs lack knowledge of your company's internal docs, policies, and data | RAG indexes your private documents — Confluence pages, Slack threads, code repos, internal wikis — making them searchable and referenceable. |
RAG Architecture
A RAG system has two phases: indexing (offline, one-time) and querying (online, per-request). Here's the complete data flow:
═══ INDEXING PHASE (offline) ═══ Documents ──→ [Document Loader] ──→ Raw Text │ [Text Splitter] │ Chunks (e.g., 512 tokens each) │ [Embedding Model] │ Dense Vectors (e.g., 1536-dim) │ [Vector Database] ──→ Stored & Indexed (with metadata) ═══ QUERY PHASE (online, per request) ═══ User Query ──→ [Embedding Model] ──→ Query Vector │ [Vector Search] (top-k similar) │ Retrieved Chunks (k=5-20) │ [Reranker] (optional but recommended) │ Top Chunks (k=3-5, reranked) │ ┌────────────┴────────────┐ │ PROMPT TEMPLATE │ │ System: You are a ... │ │ Context: {chunks} │ │ Question: {query} │ └────────────┬────────────┘ │ [LLM] │ Generated Answer (with source citations)
RAG Components Deep Dive
1. Document Loaders
Document loaders extract text from various source formats. The quality of your RAG system starts here — garbage in, garbage out.
from langchain_community.document_loaders import ( PyPDFLoader, # PDF files UnstructuredMarkdownLoader, # Markdown files CSVLoader, # CSV/spreadsheet data WebBaseLoader, # Web pages GitLoader, # Git repositories NotionDirectoryLoader, # Notion exports ConfluenceLoader, # Confluence pages (via API) SlackDirectoryLoader, # Slack message exports ) # Example: Load a PDF loader = PyPDFLoader("company_handbook.pdf") documents = loader.load() print(f"Loaded {len(documents)} pages")
2. Text Splitters (Chunking)
Documents must be split into smaller chunks that fit within embedding model context windows and provide focused, retrievable units of information.
| Strategy | How It Works | Best For | Typical Size |
|---|---|---|---|
| Fixed-Size | Split every N characters/tokens with overlap | Simple docs, unstructured text | 500-1000 tokens, 100-200 overlap |
| Recursive | Split on paragraphs, then sentences, then words | General-purpose, most common choice | 500-1000 tokens |
| Semantic | Use embeddings to detect topic shifts and split at boundaries | Long documents with topic changes | Variable (200-1500 tokens) |
| Document-Aware | Split on markdown headers, HTML tags, or code blocks | Structured docs (markdown, HTML, code) | Variable per section |
from langchain.text_splitter import ( RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter, TokenTextSplitter, ) # Recursive splitting (recommended default) splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # Max characters per chunk chunk_overlap=200, # Overlap between chunks separators=["\n\n", "\n", ". ", " ", ""] # Priority order ) chunks = splitter.split_documents(documents) # Markdown-aware splitting (for structured docs) headers_to_split = [ ("#", "h1"), ("##", "h2"), ("###", "h3"), ] md_splitter = MarkdownHeaderTextSplitter(headers_to_split) md_chunks = md_splitter.split_text(markdown_text) print(f"Split into {len(chunks)} chunks")
RecursiveCharacterTextSplitter at 1000 chars with 200 overlap. This works well for 80% of use cases. Only move to semantic or document-aware chunking when you've validated that simple chunking isn't giving good retrieval results.3. Embedding Models
Embedding models convert text into dense vector representations that capture semantic meaning. Similar texts produce vectors that are close together in the vector space.
| Model | Provider | Dimensions | Max Tokens | Cost (per 1M tokens) | Best For |
|---|---|---|---|---|---|
| text-embedding-3-large | OpenAI | 3072 | 8191 | $0.13 | Highest accuracy, enterprise |
| text-embedding-3-small | OpenAI | 1536 | 8191 | $0.02 | Good balance of cost/quality |
| embed-v3 | Cohere | 1024 | 512 | $0.10 | Multilingual, search-optimized |
| BGE-large-en-v1.5 | BAAI (open) | 1024 | 512 | Free (self-hosted) | Open-source, self-hosted |
| E5-large-v2 | Microsoft (open) | 1024 | 512 | Free (self-hosted) | Open-source, multilingual |
| voyage-3 | Voyage AI | 1024 | 16000 | $0.06 | Code search, long context |
4. Vector Stores
Vector databases store embeddings and enable fast similarity search. Choosing the right one depends on your scale and infrastructure preferences.
| Database | Type | Max Vectors | Hybrid Search | Pricing | Best For |
|---|---|---|---|---|---|
| Pinecone | Managed cloud | Billions | Yes | Free tier + pay-per-use | Production, zero-ops, fast scaling |
| Weaviate | Open / managed | Billions | Yes | Open-source + cloud option | Hybrid search, rich filtering |
| ChromaDB | Open-source | Millions | No | Free | Prototyping, local dev, simple API |
| Qdrant | Open / managed | Billions | Yes | Open-source + cloud option | Performance, Rust engine, filtering |
| pgvector | Postgres ext. | Millions | Yes (with FTS) | Free (existing Postgres) | Teams already using PostgreSQL |
| Milvus | Open-source | Billions | Yes | Open-source + Zilliz Cloud | Large-scale, GPU-accelerated |
Full RAG Pipeline: LangChain + ChromaDB + OpenAI
Here's a complete, working RAG pipeline you can run locally:
import os from langchain_community.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_chroma import Chroma from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough # ── Step 1: Load Documents ── loader = PyPDFLoader("company_handbook.pdf") documents = loader.load() print(f"Loaded {len(documents)} pages") # ── Step 2: Chunk Documents ── splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, ) chunks = splitter.split_documents(documents) print(f"Created {len(chunks)} chunks") # ── Step 3: Create Embeddings & Store in ChromaDB ── embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db", collection_name="company_docs", ) print("Vector store created and persisted") # ── Step 4: Create Retriever ── retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 5}, # Return top 5 chunks ) # ── Step 5: Create RAG Prompt ── rag_prompt = ChatPromptTemplate.from_template(""" You are a helpful assistant that answers questions based on the provided context. Use ONLY the context below to answer. If the context doesn't contain the answer, say "I don't have enough information to answer this question." Always cite which document/section your answer comes from. Context: {context} Question: {question} Answer:""") # ── Step 6: Create LLM ── llm = ChatOpenAI(model="gpt-4o", temperature=0) # ── Step 7: Build RAG Chain ── def format_docs(docs): return "\n\n---\n\n".join( f"[Source: {doc.metadata.get('source', 'unknown')}, " f"Page {doc.metadata.get('page', '?')}]\n{doc.page_content}" for doc in docs ) rag_chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | rag_prompt | llm | StrOutputParser() ) # ── Step 8: Query! ── answer = rag_chain.invoke("What is our company's remote work policy?") print(answer)
pip install langchain langchain-openai langchain-chroma langchain-community pypdf chromadbRAG with Claude API
Here's the same concept using the Anthropic Claude API directly, without LangChain:
import anthropic import chromadb from openai import OpenAI # ── Initialize clients ── claude = anthropic.Anthropic() openai_client = OpenAI() chroma = chromadb.PersistentClient(path="./chroma_db") collection = chroma.get_or_create_collection("company_docs") # ── Embed and store documents (one-time) ── def index_documents(texts, metadatas, ids): """Embed texts and store in ChromaDB.""" response = openai_client.embeddings.create( model="text-embedding-3-small", input=texts, ) embeddings = [item.embedding for item in response.data] collection.add( documents=texts, embeddings=embeddings, metadatas=metadatas, ids=ids, ) # ── Retrieve relevant chunks ── def retrieve(query, top_k=5): """Embed query and search for similar chunks.""" response = openai_client.embeddings.create( model="text-embedding-3-small", input=[query], ) query_embedding = response.data[0].embedding results = collection.query( query_embeddings=[query_embedding], n_results=top_k, ) return results # ── Generate answer with Claude ── def rag_query(question): """Full RAG pipeline: retrieve context, then generate with Claude.""" # Retrieve relevant chunks results = retrieve(question) context_chunks = results["documents"][0] metadatas = results["metadatas"][0] # Format context with source info context = "\n\n---\n\n".join( f"[Source: {m.get('source', 'unknown')}]\n{chunk}" for chunk, m in zip(context_chunks, metadatas) ) # Generate with Claude message = claude.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system="""You are a helpful assistant. Answer questions using ONLY the provided context. Cite your sources. If the context doesn't contain enough information, say so clearly.""", messages=[{ "role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}", }], ) return message.content[0].text # ── Use it ── answer = rag_query("What are the vacation day policies?") print(answer)
Advanced RAG Techniques
Basic RAG gets you 70-80% of the way. These advanced techniques push quality significantly higher:
Hybrid Search
Combine dense vector search (semantic) with sparse keyword search (BM25) for better recall. Semantic search finds conceptually similar content, while keyword search catches exact term matches that embeddings might miss.
import weaviate client = weaviate.connect_to_local() collection = client.collections.get("Documents") # Hybrid search: combines BM25 (keyword) + vector (semantic) results = collection.query.hybrid( query="remote work policy for engineering team", alpha=0.7, # 0.7 = 70% semantic, 30% keyword limit=10, return_metadata=weaviate.classes.query.MetadataQuery(score=True), ) for obj in results.objects: print(f"Score: {obj.metadata.score:.3f} | {obj.properties['text'][:100]}")
Reranking
After initial retrieval, a reranking model re-scores the results for more precise relevance. This is one of the highest-impact improvements you can make to a RAG system.
import cohere co = cohere.Client() # Initial retrieval returns 20 candidates initial_results = retriever.get_relevant_documents(query, k=20) # Rerank to find the best 5 rerank_response = co.rerank( model="rerank-v3.5", query=query, documents=[doc.page_content for doc in initial_results], top_n=5, ) # Use reranked results for LLM context reranked_docs = [initial_results[r.index] for r in rerank_response.results] for r in rerank_response.results: print(f"Score: {r.relevance_score:.3f} | Index: {r.index}")
HyDE (Hypothetical Document Embeddings)
Instead of embedding the raw query, ask an LLM to generate a hypothetical answer, then embed that. The hypothetical answer is closer in embedding space to the actual documents than a short query.
def hyde_retrieve(query, retriever, llm): """Generate hypothetical answer, embed it, search with that.""" # Step 1: Generate hypothetical answer hypothetical = llm.invoke( f"Write a detailed paragraph that would answer this question: {query}" ) # Step 2: Use hypothetical answer as the search query # This embeds the hypothetical answer instead of the short query results = retriever.get_relevant_documents(hypothetical.content) return results
Query Decomposition
For complex questions, break them into sub-questions, retrieve for each independently, then synthesize:
def decompose_and_retrieve(complex_query, llm, retriever): """Break complex query into sub-queries, retrieve for each.""" # Decompose decomposition = llm.invoke( f"""Break this complex question into 2-4 simpler sub-questions that can be answered independently: Question: {complex_query} Return each sub-question on a new line, numbered.""" ) sub_queries = decomposition.content.strip().split("\n") # Retrieve for each sub-query all_docs = [] for sub_q in sub_queries: docs = retriever.get_relevant_documents(sub_q) all_docs.extend(docs) # Deduplicate seen = set() unique_docs = [] for doc in all_docs: if doc.page_content not in seen: seen.add(doc.page_content) unique_docs.append(doc) return unique_docs
Evaluating RAG: The RAGAS Framework
You can't improve what you can't measure. RAGAS (Retrieval Augmented Generation Assessment) provides standardized metrics for evaluating RAG systems:
| Metric | What It Measures | Score Range | How to Improve |
|---|---|---|---|
| Faithfulness | Is the answer factually consistent with the retrieved context? | 0-1 (higher = better) | Better prompts, stricter context-only instructions |
| Answer Relevance | Does the answer actually address the question asked? | 0-1 (higher = better) | Better LLM, improved prompt engineering |
| Context Precision | Are the retrieved chunks actually relevant to the question? | 0-1 (higher = better) | Better chunking, reranking, hybrid search |
| Context Recall | Does the retrieved context cover all aspects of the ground truth answer? | 0-1 (higher = better) | Retrieve more chunks, better embeddings |
from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall, ) from datasets import Dataset # Prepare evaluation dataset eval_data = { "question": ["What is the remote work policy?", ...], "answer": [rag_answer_1, ...], # RAG system outputs "contexts": [[chunk_1, chunk_2], ...], # Retrieved chunks "ground_truth": ["Employees can work...", ...], # Expected answers } dataset = Dataset.from_dict(eval_data) # Run evaluation results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], ) print(results) # {'faithfulness': 0.92, 'answer_relevancy': 0.88, # 'context_precision': 0.85, 'context_recall': 0.79}
Production Considerations
Moving RAG from a notebook to production requires addressing several additional concerns:
Caching
Cache at multiple levels to reduce cost and latency:
- Embedding cache: Don't re-embed the same query twice. Use Redis or an in-memory LRU cache.
- Retrieval cache: Cache vector search results for repeated queries (with a TTL for freshness).
- LLM response cache: Cache complete responses for identical query + context combinations. Semantic caching (using embeddings to match similar queries) extends cache hit rates.
Streaming
Stream the LLM response as it generates. Users see the first tokens in <500ms instead of waiting 3-5 seconds for the full response:
# Stream with Anthropic Claude with claude.messages.stream( model="claude-sonnet-4-20250514", max_tokens=1024, system=system_prompt, messages=[{"role": "user", "content": prompt_with_context}], ) as stream: for text in stream.text_stream: print(text, end="", flush=True)
Document Refresh
Knowledge bases change. You need a strategy for keeping your index current:
- Full re-index: Periodic complete rebuild. Simple but expensive for large corpora.
- Incremental updates: Track document hashes, only re-embed changed or new documents.
- Real-time sync: Webhook-driven updates when source documents change (Confluence, Notion, etc.).
Access Control
In enterprise RAG, not every user should see every document. Implement access control at the retrieval layer:
- Tag each chunk with permission metadata (team, role, classification level)
- Filter vector search results by the querying user's permissions
- Audit which documents were used to generate each response
Summary
RAG is the foundational multi-model pattern. Here's what to remember:
- RAG = Embedding Model + Vector Store + LLM — three models working together
- Start simple: Recursive chunking + OpenAI embeddings + ChromaDB + Claude/GPT-4
- Measure first: Use RAGAS to establish baseline metrics before optimizing
- Add reranking: The single highest-impact improvement for retrieval quality
- Optimize chunking: Your chunk strategy matters more than your embedding model choice
- Cache aggressively: Embedding and retrieval caching dramatically reduce costs
- Stream always: Users perceive streamed responses as faster even when total time is the same
Lilly Tech Systems