Intermediate
Data Ingestion
Load and preprocess data from PDFs, web pages, databases, APIs, and collaboration tools for your RAG pipeline.
Data Sources Overview
| Source | Loader | Considerations |
|---|---|---|
| PDFs | PyPDF, Unstructured, PDFPlumber | Tables, images, scanned docs |
| Web Pages | BeautifulSoup, Playwright, Firecrawl | JavaScript rendering, rate limiting |
| Databases | SQLAlchemy, direct connectors | Schema mapping, incremental sync |
| APIs | REST/GraphQL clients | Pagination, authentication |
| Slack | Slack API, LangChain loader | Threads, attachments, permissions |
| Notion | Notion API, LangChain loader | Blocks, databases, nested pages |
| Confluence | Confluence API, Atlassian SDK | Spaces, permissions, macros |
Loading PDFs
Python - PDF Loading
from langchain_community.document_loaders import PyPDFLoader # Load a single PDF loader = PyPDFLoader("company_handbook.pdf") pages = loader.load() # Each page is a Document with content and metadata for page in pages: print(f"Page {page.metadata['page']}: {page.page_content[:100]}...") # For better table extraction, use Unstructured from langchain_community.document_loaders import UnstructuredPDFLoader loader = UnstructuredPDFLoader( "report_with_tables.pdf", mode="elements", # Preserves document structure strategy="hi_res" # Better for complex layouts ) docs = loader.load()
Loading Web Pages
Python - Web Loading
from langchain_community.document_loaders import WebBaseLoader # Load a single web page loader = WebBaseLoader("https://docs.example.com/getting-started") docs = loader.load() # Load multiple pages urls = [ "https://docs.example.com/setup", "https://docs.example.com/api", "https://docs.example.com/faq", ] loader = WebBaseLoader(urls) docs = loader.load() # For JavaScript-heavy sites, use Playwright from langchain_community.document_loaders import PlaywrightURLLoader loader = PlaywrightURLLoader(urls=urls, remove_selectors=["nav", "footer"]) docs = loader.load()
Loading from Databases
Python - Database Loading
from langchain_community.document_loaders import SQLDatabaseLoader from langchain_community.utilities import SQLDatabase # Connect to database db = SQLDatabase.from_uri("postgresql://user:pass@localhost/mydb") # Load data with a query loader = SQLDatabaseLoader( query="SELECT title, content, updated_at FROM articles WHERE published = true", db=db, page_content_columns=["title", "content"], metadata_columns=["updated_at"] ) docs = loader.load()
Loading from Collaboration Tools
Python - Notion & Slack
# Notion from langchain_community.document_loaders import NotionDBLoader loader = NotionDBLoader( integration_token="secret_...", database_id="abc123..." ) docs = loader.load() # Slack from langchain_community.document_loaders import SlackDirectoryLoader loader = SlackDirectoryLoader( zip_path="slack_export.zip", workspace_url="https://myteam.slack.com" ) docs = loader.load()
Metadata Extraction
Rich metadata improves retrieval quality. Always extract and attach metadata to your documents:
Python - Metadata Enrichment
from langchain.schema import Document # Enrich documents with metadata enriched_docs = [] for doc in raw_docs: enriched = Document( page_content=doc.page_content, metadata={ **doc.metadata, "source_type": "pdf", "department": "engineering", "doc_title": extract_title(doc), "word_count": len(doc.page_content.split()), "last_updated": "2026-01-15" } ) enriched_docs.append(enriched)
Metadata enables filtering: During retrieval, you can filter by metadata (e.g., "only search engineering docs from the last year") to improve relevance and reduce noise.
Data Cleaning and Preprocessing
Raw documents often contain noise. Clean them before chunking:
Python - Text Cleaning
import re def clean_text(text: str) -> str: """Clean extracted text for RAG ingestion.""" # Remove excessive whitespace text = re.sub(r'\s+', ' ', text) # Remove page numbers and headers/footers text = re.sub(r'Page \d+ of \d+', '', text) # Remove special characters that add no meaning text = re.sub(r'[^\w\s.,;:!?()\'"-]', '', text) # Normalize unicode text = text.encode('ascii', 'ignore').decode() return text.strip()
Garbage in, garbage out: Poor quality input data is the most common cause of poor RAG results. Invest time in cleaning and validating your data before indexing it.
What's Next?
The next lesson covers chunking strategies — how to split documents into optimal pieces for embedding and retrieval.
Lilly Tech Systems