Intermediate

Document Ingestion Pipeline

The quality of your RAG system is determined by the quality of your ingestion pipeline. Poor chunking leads to irrelevant retrieval, which leads to bad answers. This lesson covers how to build a production ingestion pipeline that handles real-world documents reliably.

Why Ingestion Is the Hardest Part

Most RAG tutorials skip over ingestion with a single line: "split your documents into chunks." In production, ingestion is where most RAG systems fail. Real documents have headers, tables, images, footnotes, multi-column layouts, and nested structures that naive splitters destroy.

A production ingestion pipeline must handle:

Multiple file formats (PDF, HTML, DOCX, Markdown, CSV, PPTX)
Structural elements (tables, lists, code blocks, headers)
Metadata extraction (author, date, section, page number)
Deduplication (same content appearing in multiple documents)
Incremental updates (add new docs without re-processing everything)

Chunking Strategies

Chunking is how you split documents into pieces small enough to fit in the retriever's search space but large enough to contain meaningful information. The strategy you choose has a direct impact on retrieval quality.

1. Fixed-Size Chunking

Split text into chunks of a fixed token count (e.g., 512 tokens) with overlap (e.g., 50 tokens). Simple and fast, but it ignores document structure.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_text(document_text)
# Each chunk is ~512 chars with 50-char overlap

💡

Apply at work: Start with fixed-size chunks of 512 tokens and 50-token overlap. This baseline gets you 70%+ of the way there. Only switch to semantic chunking after you have evaluation metrics showing fixed-size is the bottleneck.

2. Recursive Character Splitting

LangChain's RecursiveCharacterTextSplitter tries to split on paragraph breaks first, then sentences, then words. This preserves natural boundaries better than fixed-size splitting while maintaining consistent chunk sizes.

The separator hierarchy matters:

\n\n — Split on paragraph breaks first (preserves paragraphs)
\n — Split on line breaks (preserves sentences within a line)
. — Split on sentence boundaries
— Split on word boundaries (last resort)

3. Semantic Chunking

Group sentences into chunks based on embedding similarity. When adjacent sentences have very different embeddings, insert a chunk boundary. This produces chunks that are semantically coherent.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95  # Split at 95th percentile dissimilarity
)

chunks = semantic_splitter.split_text(document_text)

4. Document-Structure-Aware Chunking

For documents with clear structure (HTML with headers, Markdown with sections, legal documents with numbered clauses), split on structural boundaries. Each section becomes its own chunk, preserving the document's logical organization.

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

chunks = md_splitter.split_text(markdown_text)
# Each chunk retains its header hierarchy as metadata:
# {"h1": "Chapter 1", "h2": "Section 1.1", "content": "..."}

Chunking Strategy Comparison

Strategy	Pros	Cons	Best For
Fixed-size	Simple, fast, predictable chunk count	Splits mid-sentence, ignores structure	Homogeneous text (articles, transcripts)
Recursive	Respects natural boundaries, easy to configure	May still split related content	General-purpose (best default choice)
Semantic	Coherent chunks, topic-aware	Requires embedding calls, slower, variable sizes	Diverse content (mixed topics in one doc)
Structure-aware	Preserves document organization, great metadata	Requires structured input, sections may be too large	Technical docs, legal, manuals, Markdown/HTML

Document Parsing: Handling Real File Formats

Production RAG systems ingest documents in many formats. Here is how to handle the most common ones:

PDF Parsing

PDFs are the most challenging format because they store visual layout, not semantic structure. Use specialized parsers:

# Option 1: PyPDF (fast, basic text extraction)
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("report.pdf")
pages = loader.load()  # One document per page

# Option 2: Unstructured (handles tables, images, headers)
from langchain_community.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader("report.pdf", mode="elements")
elements = loader.load()  # Separate elements: Title, NarrativeText, Table, etc.

# Option 3: LlamaParse (best quality, API-based)
from llama_parse import LlamaParse
parser = LlamaParse(result_type="markdown")
documents = parser.load_data("report.pdf")  # Converts PDF to clean Markdown

📝

Production tip: For PDF-heavy pipelines, invest in LlamaParse or a similar high-quality parser. The cost ($0.003/page) is negligible compared to the retrieval quality improvement. Bad PDF parsing is the #1 cause of "RAG doesn't work for our documents" complaints.

HTML Parsing

from langchain_community.document_loaders import BSHTMLLoader

loader = BSHTMLLoader("page.html")
doc = loader.load()
# Strips HTML tags, extracts text content
# Preserves title in metadata

Table Handling

Tables are particularly tricky because chunking destroys their row-column relationships. Best practices:

Extract tables as separate chunks with their headers preserved
Convert tables to Markdown format (LLMs understand Markdown tables well)
Store the full table as one chunk even if it exceeds your chunk size
Add table captions and surrounding context as metadata

Metadata Extraction

Metadata is the secret weapon of high-quality RAG. Attaching metadata to each chunk enables filtered retrieval, better re-ranking, and source attribution.

def enrich_chunk(chunk: str, source_doc: dict) -> dict:
    """Add metadata to each chunk for filtered retrieval."""
    return {
        "text": chunk,
        "metadata": {
            # Source tracking
            "source": source_doc["filename"],
            "page": source_doc.get("page_number"),
            "url": source_doc.get("url"),

            # Document structure
            "section": source_doc.get("section_header"),
            "doc_type": source_doc.get("type"),  # "policy", "faq", "manual"

            # Temporal
            "created_at": source_doc.get("created_at"),
            "updated_at": source_doc.get("updated_at"),

            # Access control (for multi-tenant RAG)
            "department": source_doc.get("department"),
            "access_level": source_doc.get("access_level"),
        }
    }

💡

Apply at work: At minimum, always store source, page, and updated_at metadata with every chunk. This enables source citations in answers and lets you filter out stale documents during retrieval.

Production Ingestion Pipeline Architecture

Here is a complete ingestion pipeline suitable for production workloads:

import hashlib
from pathlib import Path
from datetime import datetime

class IngestionPipeline:
    def __init__(self, vector_store, embeddings, splitter):
        self.vector_store = vector_store
        self.embeddings = embeddings
        self.splitter = splitter
        self.processed_hashes = set()  # Track processed docs

    def ingest_document(self, file_path: str) -> int:
        """Ingest a single document into the vector store."""
        # 1. Compute hash for deduplication
        content_hash = self._hash_file(file_path)
        if content_hash in self.processed_hashes:
            return 0  # Skip duplicate

        # 2. Parse document based on file type
        text, doc_metadata = self._parse(file_path)

        # 3. Chunk the document
        chunks = self.splitter.split_text(text)

        # 4. Enrich chunks with metadata
        documents = []
        for i, chunk in enumerate(chunks):
            documents.append({
                "text": chunk,
                "metadata": {
                    **doc_metadata,
                    "chunk_index": i,
                    "total_chunks": len(chunks),
                    "content_hash": content_hash,
                    "ingested_at": datetime.utcnow().isoformat(),
                }
            })

        # 5. Embed and store
        texts = [d["text"] for d in documents]
        metadatas = [d["metadata"] for d in documents]
        self.vector_store.add_texts(texts, metadatas=metadatas)

        self.processed_hashes.add(content_hash)
        return len(documents)

    def ingest_directory(self, dir_path: str) -> dict:
        """Ingest all supported files in a directory."""
        stats = {"processed": 0, "skipped": 0, "chunks": 0}
        supported = {".pdf", ".md", ".html", ".txt", ".docx", ".csv"}

        for file_path in Path(dir_path).rglob("*"):
            if file_path.suffix.lower() in supported:
                n = self.ingest_document(str(file_path))
                if n > 0:
                    stats["processed"] += 1
                    stats["chunks"] += n
                else:
                    stats["skipped"] += 1

        return stats

    def _hash_file(self, path: str) -> str:
        return hashlib.sha256(Path(path).read_bytes()).hexdigest()

    def _parse(self, path: str) -> tuple:
        """Route to appropriate parser based on file extension."""
        ext = Path(path).suffix.lower()
        # ... parser routing logic
        return text, metadata

Key Takeaways

Ingestion quality determines RAG quality — invest here first.
Start with recursive character splitting at 512 tokens with 50-token overlap as your baseline.
Use specialized parsers for PDFs — generic text extraction loses table and header structure.
Always attach metadata (source, page, date, section) to every chunk for filtering and citation.
Build deduplication and incremental updates into your pipeline from day one.

← Previous RAG Fundamentals Next → Embedding & Vector Indexing