Intermediate
Document Summarization
Generate accurate summaries and extract key information from long documents using extractive, abstractive, and LLM-based approaches.
Summarization Approaches
| Approach | Method | Pros | Cons |
|---|---|---|---|
| Extractive | Select key sentences | Faithful, no hallucination | May lack coherence |
| Abstractive | Generate new text | Natural, concise | Risk of hallucination |
| Hybrid | Extract then refine | Balanced accuracy/fluency | More complex pipeline |
| Map-Reduce | Chunk, summarize, merge | Handles long documents | May lose cross-section context |
LLM-Based Summarization
Python — Document Summarization Pipeline
from openai import OpenAI
client = OpenAI()
def summarize_document(text, summary_type="executive"):
"""Generate a document summary using an LLM."""
prompts = {
"executive": "Write a 3-5 sentence executive summary focusing on key decisions, findings, and action items.",
"technical": "Write a technical summary covering methodology, data, results, and limitations.",
"bullet": "Extract the 5-10 most important points as bullet points.",
"legal": "Summarize key obligations, rights, dates, and risk areas."
}
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": f"You are a document summarization expert. {prompts[summary_type]} Only include information explicitly stated in the document."
}, {
"role": "user",
"content": f"Summarize this document:\n\n{text}"
}],
temperature=0.3 # Lower temperature for factual accuracy
)
return response.choices[0].message.content
def summarize_long_document(text, chunk_size=3000):
"""Map-reduce summarization for long documents."""
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
# Map: summarize each chunk
chunk_summaries = [summarize_document(chunk, "bullet") for chunk in chunks]
# Reduce: synthesize chunk summaries into final summary
combined = "\n\n".join(chunk_summaries)
return summarize_document(combined, "executive")
Key Information Extraction
- Named entities: People, organizations, dates, monetary amounts, and locations mentioned in the document.
- Key dates: Deadlines, effective dates, expiration dates, and milestones.
- Action items: Tasks, responsibilities, and commitments extracted from meeting notes and reports.
- Financial figures: Revenue, costs, projections, and financial metrics from business documents.
- Risk factors: Identified risks, warnings, and compliance concerns from legal and regulatory documents.
Verify critical summaries: LLM summaries can hallucinate details not in the source document. For legal, financial, or medical documents, always implement a verification step comparing summary claims against the source text.
Lilly Tech Systems