Intermediate

Data Ingestion

Load and preprocess data from PDFs, web pages, databases, APIs, and collaboration tools for your RAG pipeline.

Data Sources Overview

Source Loader Considerations
PDFsPyPDF, Unstructured, PDFPlumberTables, images, scanned docs
Web PagesBeautifulSoup, Playwright, FirecrawlJavaScript rendering, rate limiting
DatabasesSQLAlchemy, direct connectorsSchema mapping, incremental sync
APIsREST/GraphQL clientsPagination, authentication
SlackSlack API, LangChain loaderThreads, attachments, permissions
NotionNotion API, LangChain loaderBlocks, databases, nested pages
ConfluenceConfluence API, Atlassian SDKSpaces, permissions, macros

Loading PDFs

Python - PDF Loading
from langchain_community.document_loaders import PyPDFLoader

# Load a single PDF
loader = PyPDFLoader("company_handbook.pdf")
pages = loader.load()

# Each page is a Document with content and metadata
for page in pages:
    print(f"Page {page.metadata['page']}: {page.page_content[:100]}...")

# For better table extraction, use Unstructured
from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader(
    "report_with_tables.pdf",
    mode="elements",  # Preserves document structure
    strategy="hi_res"  # Better for complex layouts
)
docs = loader.load()

Loading Web Pages

Python - Web Loading
from langchain_community.document_loaders import WebBaseLoader

# Load a single web page
loader = WebBaseLoader("https://docs.example.com/getting-started")
docs = loader.load()

# Load multiple pages
urls = [
    "https://docs.example.com/setup",
    "https://docs.example.com/api",
    "https://docs.example.com/faq",
]
loader = WebBaseLoader(urls)
docs = loader.load()

# For JavaScript-heavy sites, use Playwright
from langchain_community.document_loaders import PlaywrightURLLoader

loader = PlaywrightURLLoader(urls=urls, remove_selectors=["nav", "footer"])
docs = loader.load()

Loading from Databases

Python - Database Loading
from langchain_community.document_loaders import SQLDatabaseLoader
from langchain_community.utilities import SQLDatabase

# Connect to database
db = SQLDatabase.from_uri("postgresql://user:pass@localhost/mydb")

# Load data with a query
loader = SQLDatabaseLoader(
    query="SELECT title, content, updated_at FROM articles WHERE published = true",
    db=db,
    page_content_columns=["title", "content"],
    metadata_columns=["updated_at"]
)
docs = loader.load()

Loading from Collaboration Tools

Python - Notion & Slack
# Notion
from langchain_community.document_loaders import NotionDBLoader

loader = NotionDBLoader(
    integration_token="secret_...",
    database_id="abc123..."
)
docs = loader.load()

# Slack
from langchain_community.document_loaders import SlackDirectoryLoader

loader = SlackDirectoryLoader(
    zip_path="slack_export.zip",
    workspace_url="https://myteam.slack.com"
)
docs = loader.load()

Metadata Extraction

Rich metadata improves retrieval quality. Always extract and attach metadata to your documents:

Python - Metadata Enrichment
from langchain.schema import Document

# Enrich documents with metadata
enriched_docs = []
for doc in raw_docs:
    enriched = Document(
        page_content=doc.page_content,
        metadata={
            **doc.metadata,
            "source_type": "pdf",
            "department": "engineering",
            "doc_title": extract_title(doc),
            "word_count": len(doc.page_content.split()),
            "last_updated": "2026-01-15"
        }
    )
    enriched_docs.append(enriched)
Metadata enables filtering: During retrieval, you can filter by metadata (e.g., "only search engineering docs from the last year") to improve relevance and reduce noise.

Data Cleaning and Preprocessing

Raw documents often contain noise. Clean them before chunking:

Python - Text Cleaning
import re

def clean_text(text: str) -> str:
    """Clean extracted text for RAG ingestion."""
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    # Remove page numbers and headers/footers
    text = re.sub(r'Page \d+ of \d+', '', text)
    # Remove special characters that add no meaning
    text = re.sub(r'[^\w\s.,;:!?()\'"-]', '', text)
    # Normalize unicode
    text = text.encode('ascii', 'ignore').decode()
    return text.strip()
Garbage in, garbage out: Poor quality input data is the most common cause of poor RAG results. Invest time in cleaning and validating your data before indexing it.

What's Next?

The next lesson covers chunking strategies — how to split documents into optimal pieces for embedding and retrieval.