Building Multi-Modal Applications Advanced

Moving from understanding multi-modal AI concepts to building production applications requires practical knowledge of APIs, architectures, and design patterns. This lesson walks through real-world application patterns and shows you how to combine multiple modalities effectively.

Application Architecture Patterns

Multi-modal applications typically follow one of these architectural patterns:

Pattern	Description	Best For
Single API	Send all modalities to one multi-modal model	Simple apps, prototyping, when one model handles all inputs
Pipeline	Chain specialized models (e.g., Whisper → LLM → TTS)	Best-in-class per modality, more control over each step
Fan-Out	Process modalities in parallel, then merge results	Low-latency requirements, independent analysis tasks
Hybrid	Combine single API for some modalities with specialized models	Production systems balancing quality, cost, and latency

Common Multi-Modal Applications

Intelligent Document Processing
Extract structured data from PDFs, invoices, receipts, and forms that contain text, tables, images, and handwriting. Combine OCR with vision-language models for high accuracy.
Content Moderation
Analyze user-generated content across text, images, audio, and video to detect policy violations, harmful content, and misinformation. Multi-modal analysis catches content that single-modality systems miss.
Accessibility Tools
Build tools that describe images for screen readers, generate captions for video, transcribe audio in real-time, and translate between modalities to improve accessibility.
Customer Support Bots
Handle customer queries that include screenshots of error messages, photos of damaged products, voice messages, and text — all in a unified conversation.

Example: Document Analysis Pipeline

import anthropic
import base64

def analyze_document(file_path: str, questions: list[str]) -> dict:
    """Analyze a document image and answer questions about it."""
    client = anthropic.Anthropic()

    with open(file_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    prompt = "Analyze this document thoroughly.\n\n"
    for i, q in enumerate(questions, 1):
        prompt += f"{i}. {q}\n"

    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_data
                }},
                {"type": "text", "text": prompt}
            ]
        }]
    )
    return {"analysis": message.content[0].text}

# Usage
result = analyze_document(
    "invoice.png",
    ["What is the total amount?",
     "What is the invoice date?",
     "List all line items."]
)

Cost Optimization: Multi-modal API calls are significantly more expensive than text-only calls. Resize images to the minimum resolution that preserves the information you need, cache results for repeated analyses, and consider whether a text extraction step followed by a text-only LLM call might be sufficient for your use case.

Error Handling: Multi-modal applications have more failure modes than text-only systems. Images may be corrupt, audio may be too noisy, or video may be in an unsupported format. Build robust error handling and fallback strategies into your pipeline.

Next: Best Practices

In the final lesson, you will learn production deployment strategies, evaluation metrics, and emerging trends in multi-modal AI.

Next: Best Practices →

← Video Understanding Best Practices →