Intermediate

Document Summarization

Generate accurate summaries and extract key information from long documents using extractive, abstractive, and LLM-based approaches.

Summarization Approaches

ApproachMethodProsCons
ExtractiveSelect key sentencesFaithful, no hallucinationMay lack coherence
AbstractiveGenerate new textNatural, conciseRisk of hallucination
HybridExtract then refineBalanced accuracy/fluencyMore complex pipeline
Map-ReduceChunk, summarize, mergeHandles long documentsMay lose cross-section context

LLM-Based Summarization

Python — Document Summarization Pipeline
from openai import OpenAI

client = OpenAI()

def summarize_document(text, summary_type="executive"):
    """Generate a document summary using an LLM."""
    prompts = {
        "executive": "Write a 3-5 sentence executive summary focusing on key decisions, findings, and action items.",
        "technical": "Write a technical summary covering methodology, data, results, and limitations.",
        "bullet": "Extract the 5-10 most important points as bullet points.",
        "legal": "Summarize key obligations, rights, dates, and risk areas."
    }

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": f"You are a document summarization expert. {prompts[summary_type]} Only include information explicitly stated in the document."
        }, {
            "role": "user",
            "content": f"Summarize this document:\n\n{text}"
        }],
        temperature=0.3  # Lower temperature for factual accuracy
    )
    return response.choices[0].message.content

def summarize_long_document(text, chunk_size=3000):
    """Map-reduce summarization for long documents."""
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

    # Map: summarize each chunk
    chunk_summaries = [summarize_document(chunk, "bullet") for chunk in chunks]

    # Reduce: synthesize chunk summaries into final summary
    combined = "\n\n".join(chunk_summaries)
    return summarize_document(combined, "executive")

Key Information Extraction

  • Named entities: People, organizations, dates, monetary amounts, and locations mentioned in the document.
  • Key dates: Deadlines, effective dates, expiration dates, and milestones.
  • Action items: Tasks, responsibilities, and commitments extracted from meeting notes and reports.
  • Financial figures: Revenue, costs, projections, and financial metrics from business documents.
  • Risk factors: Identified risks, warnings, and compliance concerns from legal and regulatory documents.
Verify critical summaries: LLM summaries can hallucinate details not in the source document. For legal, financial, or medical documents, always implement a verification step comparing summary claims against the source text.