Beginner

Project Setup

In this first step, you will set up the project structure, install all dependencies, and verify that PyMuPDF, OpenAI Vision, and FastAPI are working together. By the end of this lesson, you will have a running server ready to accept document uploads.

Architecture Overview

The Document Intelligence App has four main processing stages that documents flow through:

  • Upload Handler: Accepts PDF, image, and document files via drag-and-drop or API endpoint.
  • Text Extraction: PyMuPDF extracts text, tables, and layout from digital PDFs. tabula handles complex table structures.
  • Vision Analysis: GPT-4 Vision processes scanned documents, handwritten notes, and complex visual layouts.
  • Structured Output: Pydantic models validate and structure extracted data into clean JSON for downstream systems.
Document Upload
    |
    v
[File Type Detection]
    |
    +--- Digital PDF ---> [PyMuPDF] ---> Text + Tables
    |                                        |
    +--- Scanned/Image -> [GPT-4 Vision] -> Visual Understanding
    |                                        |
    +------------------------------------+---+
                                         |
                                         v
                                  [Pydantic Validation]
                                         |
                                         v
                                  [Structured JSON Output]

Step 1: Create the Project Structure

Create the following directory structure:

doc-intelligence/
+-- .env
+-- .env.example
+-- requirements.txt
+-- app/
|   +-- __init__.py
|   +-- main.py              # FastAPI entry point
|   +-- config.py             # Environment config
|   +-- extraction/
|   |   +-- pdf_extractor.py  # PyMuPDF text extraction
|   |   +-- table_extractor.py# tabula table extraction
|   |   +-- layout_analyzer.py# Page layout analysis
|   +-- vision/
|   |   +-- vision_analyzer.py# GPT-4 Vision integration
|   +-- structuring/
|   |   +-- schemas.py        # Pydantic extraction schemas
|   |   +-- extractor.py      # Field extraction logic
|   +-- pipeline/
|   |   +-- processor.py      # Document processing pipeline
|   |   +-- queue.py          # Async job queue
|   +-- models/
|       +-- document.py       # Document data models
+-- frontend/
|   +-- index.html            # Upload and review UI
+-- uploads/
+-- results/
+-- tests/
    +-- test_extraction.py

Run these commands to create the structure:

# Create project directory
mkdir -p doc-intelligence/{app/{extraction,vision,structuring,pipeline,models},frontend,uploads,results,tests}

# Create __init__.py files
touch doc-intelligence/app/__init__.py
touch doc-intelligence/app/extraction/__init__.py
touch doc-intelligence/app/vision/__init__.py
touch doc-intelligence/app/structuring/__init__.py
touch doc-intelligence/app/pipeline/__init__.py
touch doc-intelligence/app/models/__init__.py

Step 2: Define Dependencies

Create requirements.txt with all the packages we need:

# requirements.txt
fastapi==0.115.6
uvicorn[standard]==0.34.0
python-dotenv==1.0.1
pydantic-settings==2.7.1
python-multipart==0.0.20

# PDF Processing
PyMuPDF==1.25.1
tabula-py==2.9.3
Pillow==11.1.0

# AI / Vision
openai==1.58.1

# Async processing
aiofiles==24.1.0
celery[redis]==5.4.0

# Utilities
httpx==0.28.1
python-magic==0.4.27

Step 3: Environment Configuration

Create .env.example and then copy it to .env:

# .env.example
OPENAI_API_KEY=sk-your-key-here
OPENAI_VISION_MODEL=gpt-4o
OPENAI_CHAT_MODEL=gpt-4o-mini

UPLOAD_DIR=uploads
RESULTS_DIR=results
MAX_FILE_SIZE_MB=50
ALLOWED_EXTENSIONS=pdf,png,jpg,jpeg,tiff,bmp

LOG_LEVEL=INFO

Now create the config module that loads these values with validation:

# app/config.py
from pydantic_settings import BaseSettings
from functools import lru_cache


class Settings(BaseSettings):
    """Application settings loaded from environment variables."""

    # OpenAI
    openai_api_key: str
    openai_vision_model: str = "gpt-4o"
    openai_chat_model: str = "gpt-4o-mini"

    # File handling
    upload_dir: str = "uploads"
    results_dir: str = "results"
    max_file_size_mb: int = 50
    allowed_extensions: str = "pdf,png,jpg,jpeg,tiff,bmp"

    # Logging
    log_level: str = "INFO"

    class Config:
        env_file = ".env"
        env_file_encoding = "utf-8"

    @property
    def allowed_ext_list(self) -> list[str]:
        return [ext.strip() for ext in self.allowed_extensions.split(",")]

    @property
    def max_file_size_bytes(self) -> int:
        return self.max_file_size_mb * 1024 * 1024


@lru_cache()
def get_settings() -> Settings:
    return Settings()

Step 4: Create the FastAPI Entry Point

Create app/main.py with file upload support:

# app/main.py
import logging
from pathlib import Path

from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from fastapi.responses import FileResponse

from app.config import get_settings

settings = get_settings()

# Create directories
Path(settings.upload_dir).mkdir(parents=True, exist_ok=True)
Path(settings.results_dir).mkdir(parents=True, exist_ok=True)

logging.basicConfig(
    level=getattr(logging, settings.log_level),
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)

app = FastAPI(
    title="Document Intelligence API",
    description="AI-powered document parsing and data extraction",
    version="1.0.0",
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

app.mount("/static", StaticFiles(directory="frontend"), name="static")


@app.get("/")
async def root():
    return FileResponse("frontend/index.html")


@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "vision_model": settings.openai_vision_model,
        "max_file_size_mb": settings.max_file_size_mb,
        "allowed_extensions": settings.allowed_ext_list,
    }


@app.post("/api/upload")
async def upload_document(file: UploadFile = File(...)):
    """Upload a document for processing."""
    ext = file.filename.rsplit(".", 1)[-1].lower() if file.filename else ""
    if ext not in settings.allowed_ext_list:
        raise HTTPException(
            status_code=400,
            detail=f"File type .{ext} not allowed. Allowed: {settings.allowed_ext_list}",
        )

    content = await file.read()
    if len(content) > settings.max_file_size_bytes:
        raise HTTPException(status_code=400, detail=f"File too large. Max: {settings.max_file_size_mb}MB")

    file_path = Path(settings.upload_dir) / file.filename
    with open(file_path, "wb") as f:
        f.write(content)

    logger.info(f"Uploaded: {file.filename} ({len(content)} bytes)")
    return {
        "filename": file.filename,
        "size_bytes": len(content),
        "status": "uploaded",
        "message": "File uploaded. Use /api/process to extract data.",
    }

Step 5: Verify the Setup

# Create virtual environment and install
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# Copy env and add your API key
cp .env.example .env

# Start the server
uvicorn app.main:app --reload --port 8000

# Test health endpoint
curl http://localhost:8000/health

# Test file upload
curl -X POST http://localhost:8000/api/upload -F "file=@sample.pdf"
📝
Checkpoint: Your FastAPI server should be running on port 8000. The health endpoint should return your configuration, and file upload should save files to the uploads directory.

Key Takeaways

  • The project separates concerns into extraction, vision, structuring, and pipeline packages.
  • PyMuPDF handles digital PDFs, GPT-4 Vision handles scanned/visual documents, Pydantic validates output.
  • File upload validation prevents oversized files and unsupported formats from entering the pipeline.
  • The configuration module centralizes all settings and validates them at startup.

What Is Next

In the next lesson, you will build the PDF text and table extraction module — the code that reads PDF files, extracts text with layout awareness, and pulls structured data from tables using PyMuPDF and tabula.