Document Ingestion Pipeline
The quality of your RAG system is determined by the quality of your ingestion pipeline. Poor chunking leads to irrelevant retrieval, which leads to bad answers. This lesson covers how to build a production ingestion pipeline that handles real-world documents reliably.
Why Ingestion Is the Hardest Part
Most RAG tutorials skip over ingestion with a single line: "split your documents into chunks." In production, ingestion is where most RAG systems fail. Real documents have headers, tables, images, footnotes, multi-column layouts, and nested structures that naive splitters destroy.
A production ingestion pipeline must handle:
- Multiple file formats (PDF, HTML, DOCX, Markdown, CSV, PPTX)
- Structural elements (tables, lists, code blocks, headers)
- Metadata extraction (author, date, section, page number)
- Deduplication (same content appearing in multiple documents)
- Incremental updates (add new docs without re-processing everything)
Chunking Strategies
Chunking is how you split documents into pieces small enough to fit in the retriever's search space but large enough to contain meaningful information. The strategy you choose has a direct impact on retrieval quality.
1. Fixed-Size Chunking
Split text into chunks of a fixed token count (e.g., 512 tokens) with overlap (e.g., 50 tokens). Simple and fast, but it ignores document structure.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)
# Each chunk is ~512 chars with 50-char overlap
2. Recursive Character Splitting
LangChain's RecursiveCharacterTextSplitter tries to split on paragraph breaks first, then sentences, then words. This preserves natural boundaries better than fixed-size splitting while maintaining consistent chunk sizes.
The separator hierarchy matters:
\n\n— Split on paragraph breaks first (preserves paragraphs)\n— Split on line breaks (preserves sentences within a line).— Split on sentence boundaries— Split on word boundaries (last resort)
3. Semantic Chunking
Group sentences into chunks based on embedding similarity. When adjacent sentences have very different embeddings, insert a chunk boundary. This produces chunks that are semantically coherent.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
semantic_splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95 # Split at 95th percentile dissimilarity
)
chunks = semantic_splitter.split_text(document_text)
4. Document-Structure-Aware Chunking
For documents with clear structure (HTML with headers, Markdown with sections, legal documents with numbered clauses), split on structural boundaries. Each section becomes its own chunk, preserving the document's logical organization.
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
md_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
chunks = md_splitter.split_text(markdown_text)
# Each chunk retains its header hierarchy as metadata:
# {"h1": "Chapter 1", "h2": "Section 1.1", "content": "..."}
Chunking Strategy Comparison
| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Fixed-size | Simple, fast, predictable chunk count | Splits mid-sentence, ignores structure | Homogeneous text (articles, transcripts) |
| Recursive | Respects natural boundaries, easy to configure | May still split related content | General-purpose (best default choice) |
| Semantic | Coherent chunks, topic-aware | Requires embedding calls, slower, variable sizes | Diverse content (mixed topics in one doc) |
| Structure-aware | Preserves document organization, great metadata | Requires structured input, sections may be too large | Technical docs, legal, manuals, Markdown/HTML |
Document Parsing: Handling Real File Formats
Production RAG systems ingest documents in many formats. Here is how to handle the most common ones:
PDF Parsing
PDFs are the most challenging format because they store visual layout, not semantic structure. Use specialized parsers:
# Option 1: PyPDF (fast, basic text extraction)
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("report.pdf")
pages = loader.load() # One document per page
# Option 2: Unstructured (handles tables, images, headers)
from langchain_community.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader("report.pdf", mode="elements")
elements = loader.load() # Separate elements: Title, NarrativeText, Table, etc.
# Option 3: LlamaParse (best quality, API-based)
from llama_parse import LlamaParse
parser = LlamaParse(result_type="markdown")
documents = parser.load_data("report.pdf") # Converts PDF to clean Markdown
HTML Parsing
from langchain_community.document_loaders import BSHTMLLoader
loader = BSHTMLLoader("page.html")
doc = loader.load()
# Strips HTML tags, extracts text content
# Preserves title in metadata
Table Handling
Tables are particularly tricky because chunking destroys their row-column relationships. Best practices:
- Extract tables as separate chunks with their headers preserved
- Convert tables to Markdown format (LLMs understand Markdown tables well)
- Store the full table as one chunk even if it exceeds your chunk size
- Add table captions and surrounding context as metadata
Metadata Extraction
Metadata is the secret weapon of high-quality RAG. Attaching metadata to each chunk enables filtered retrieval, better re-ranking, and source attribution.
def enrich_chunk(chunk: str, source_doc: dict) -> dict:
"""Add metadata to each chunk for filtered retrieval."""
return {
"text": chunk,
"metadata": {
# Source tracking
"source": source_doc["filename"],
"page": source_doc.get("page_number"),
"url": source_doc.get("url"),
# Document structure
"section": source_doc.get("section_header"),
"doc_type": source_doc.get("type"), # "policy", "faq", "manual"
# Temporal
"created_at": source_doc.get("created_at"),
"updated_at": source_doc.get("updated_at"),
# Access control (for multi-tenant RAG)
"department": source_doc.get("department"),
"access_level": source_doc.get("access_level"),
}
}
source, page, and updated_at metadata with every chunk. This enables source citations in answers and lets you filter out stale documents during retrieval.Production Ingestion Pipeline Architecture
Here is a complete ingestion pipeline suitable for production workloads:
import hashlib
from pathlib import Path
from datetime import datetime
class IngestionPipeline:
def __init__(self, vector_store, embeddings, splitter):
self.vector_store = vector_store
self.embeddings = embeddings
self.splitter = splitter
self.processed_hashes = set() # Track processed docs
def ingest_document(self, file_path: str) -> int:
"""Ingest a single document into the vector store."""
# 1. Compute hash for deduplication
content_hash = self._hash_file(file_path)
if content_hash in self.processed_hashes:
return 0 # Skip duplicate
# 2. Parse document based on file type
text, doc_metadata = self._parse(file_path)
# 3. Chunk the document
chunks = self.splitter.split_text(text)
# 4. Enrich chunks with metadata
documents = []
for i, chunk in enumerate(chunks):
documents.append({
"text": chunk,
"metadata": {
**doc_metadata,
"chunk_index": i,
"total_chunks": len(chunks),
"content_hash": content_hash,
"ingested_at": datetime.utcnow().isoformat(),
}
})
# 5. Embed and store
texts = [d["text"] for d in documents]
metadatas = [d["metadata"] for d in documents]
self.vector_store.add_texts(texts, metadatas=metadatas)
self.processed_hashes.add(content_hash)
return len(documents)
def ingest_directory(self, dir_path: str) -> dict:
"""Ingest all supported files in a directory."""
stats = {"processed": 0, "skipped": 0, "chunks": 0}
supported = {".pdf", ".md", ".html", ".txt", ".docx", ".csv"}
for file_path in Path(dir_path).rglob("*"):
if file_path.suffix.lower() in supported:
n = self.ingest_document(str(file_path))
if n > 0:
stats["processed"] += 1
stats["chunks"] += n
else:
stats["skipped"] += 1
return stats
def _hash_file(self, path: str) -> str:
return hashlib.sha256(Path(path).read_bytes()).hexdigest()
def _parse(self, path: str) -> tuple:
"""Route to appropriate parser based on file extension."""
ext = Path(path).suffix.lower()
# ... parser routing logic
return text, metadata
Key Takeaways
- Ingestion quality determines RAG quality — invest here first.
- Start with recursive character splitting at 512 tokens with 50-token overlap as your baseline.
- Use specialized parsers for PDFs — generic text extraction loses table and header structure.
- Always attach metadata (source, page, date, section) to every chunk for filtering and citation.
- Build deduplication and incremental updates into your pipeline from day one.
Lilly Tech Systems