Intermediate

Vision + LLM Applications

Combining computer vision models with large language models unlocks powerful multimodal applications — from visual question answering and product recognition to automated defect detection with natural language reports.

The Vision + Language Architecture

At its core, a vision + LLM application takes visual input (images or video), extracts structured understanding through a vision model, then feeds that understanding to a language model for reasoning, generation, or decision-making. There are two fundamental approaches:

Architecture Overview
# Approach 1: Pipeline Architecture
Image/Video
  → Vision Model (YOLO, SAM, OCR, Florence-2)
    → Structured Data (bounding boxes, labels, text)
      → LLM (reasoning, report generation)
        → Final Output

# Approach 2: Native Multimodal
Image/Video + Text Prompt
  → Multimodal LLM (GPT-4V, Claude Vision, Gemini Pro Vision)
    → Final Output (text, structured JSON, decisions)

Native Multimodal vs Pipeline Approach

Choosing between a native multimodal model and a dedicated vision + LLM pipeline depends on your accuracy requirements, latency budget, and the specificity of your vision task.

Factor Native Multimodal (GPT-4V, Claude Vision) Pipeline (YOLO/SAM + LLM)
Setup complexity Low — single API call High — multiple models, data marshaling
Latency 1–5s per image 0.1–2s (vision) + 0.5–3s (LLM)
Domain accuracy Good general understanding Superior for specialized tasks (defects, medical)
Fine-tuning Limited or unavailable Full control over vision model training
Cost per image $0.01–$0.05 (API pricing) $0.001–$0.01 (self-hosted) + LLM cost
Spatial precision Approximate bounding descriptions Pixel-level coordinates and masks
Offline capability No (cloud API required) Yes (vision model can run locally)
Best for General VQA, captioning, document understanding Manufacturing QC, medical imaging, real-time detection

Key Models and Their Roles

  • GPT-4V / GPT-4o: Native multimodal — accepts images directly in prompts, strong at general visual reasoning and OCR
  • Claude Vision (Claude Sonnet 4 / Opus 4): Native multimodal with strong structured output, excels at document and chart understanding
  • Gemini Pro Vision: Google's multimodal model with long-context image support and video understanding
  • YOLOv8/v9: Real-time object detection and classification, ideal for production pipelines requiring speed
  • SAM (Segment Anything Model): Zero-shot image segmentation — isolate any object with point or box prompts
  • Florence-2: Microsoft's unified vision model handling captioning, detection, segmentation, and OCR in one architecture
  • PaddleOCR / Tesseract: Dedicated OCR engines for high-accuracy text extraction from images

Use Cases

  • Visual Question Answering (VQA): Users upload an image and ask natural language questions about its content
  • Image Captioning & Alt Text: Automatically generate accessible descriptions for web images
  • Product Recognition: Identify products from photos for inventory management or e-commerce search
  • Defect Detection + Reporting: Vision model detects anomalies, LLM generates human-readable inspection reports
  • Medical Imaging + Diagnosis Assistance: Analyze X-rays or scans, then generate structured findings for radiologists
  • Chart/Graph Understanding: Extract data from visual charts and answer questions about trends

Code Example: Product Analyzer with Claude Vision

This example sends a product image to Claude's multimodal API and receives structured product information including name, category, estimated price range, and condition.

Python - Product Analyzer with Claude Vision
import anthropic
import base64
import json
from pathlib import Path

client = anthropic.Anthropic()

def analyze_product(image_path: str) -> dict:
    """Analyze a product image and return structured information."""
    # Read and encode the image
    image_data = base64.standard_b64encode(
        Path(image_path).read_bytes()
    ).decode("utf-8")

    # Determine media type from extension
    suffix = Path(image_path).suffix.lower()
    media_types = {".jpg": "image/jpeg", ".png": "image/png",
                   ".gif": "image/gif", ".webp": "image/webp"}
    media_type = media_types.get(suffix, "image/jpeg")

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": """Analyze this product image. Return a JSON object with:
{
  "product_name": "identified product name",
  "category": "product category",
  "brand": "brand if visible, else null",
  "condition": "new/used/refurbished/unknown",
  "estimated_price_range": {"min": 0, "max": 0, "currency": "USD"},
  "key_features": ["feature1", "feature2"],
  "description": "one paragraph product description",
  "confidence": 0.0 to 1.0
}
Return ONLY valid JSON, no markdown."""
                }
            ]
        }]
    )

    return json.loads(response.content[0].text)

# Usage
result = analyze_product("product_photo.jpg")
print(f"Product: {result['product_name']}")
print(f"Category: {result['category']}")
print(f"Price: ${result['estimated_price_range']['min']}-${result['estimated_price_range']['max']}")
print(f"Features: {', '.join(result['key_features'])}")

Code Example: Security Camera Pipeline (YOLO + LLM)

This pipeline uses YOLOv8 for real-time object detection on security camera frames, classifies detected events, and uses an LLM to generate human-readable alert descriptions.

Python - Security Camera Alert Pipeline
from ultralytics import YOLO
import anthropic
import cv2
from datetime import datetime

class SecurityAlertPipeline:
    def __init__(self):
        self.yolo = YOLO("yolov8n.pt")  # Nano model for speed
        self.llm = anthropic.Anthropic()
        self.alert_classes = {"person", "car", "truck", "knife", "backpack"}

    def detect_objects(self, frame):
        """Run YOLO detection on a single frame."""
        results = self.yolo(frame, conf=0.5, verbose=False)
        detections = []
        for r in results:
            for box in r.boxes:
                cls_name = self.yolo.names[int(box.cls)]
                detections.append({
                    "class": cls_name,
                    "confidence": float(box.conf),
                    "bbox": box.xyxy[0].tolist()
                })
        return detections

    def classify_alert_level(self, detections):
        """Classify the security alert level based on detections."""
        classes_found = {d["class"] for d in detections}
        if classes_found & {"knife"}:
            return "HIGH"
        if "person" in classes_found and len(detections) > 3:
            return "MEDIUM"
        if classes_found & self.alert_classes:
            return "LOW"
        return "NONE"

    def generate_alert_description(self, detections, alert_level, camera_id):
        """Use LLM to generate a human-readable alert description."""
        detection_summary = "\n".join(
            f"- {d['class']} (confidence: {d['confidence']:.1%}) at position {d['bbox']}"
            for d in detections
        )

        response = self.llm.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=300,
            messages=[{
                "role": "user",
                "content": f"""Generate a brief security alert based on these detections:
Camera: {camera_id}
Time: {datetime.now().isoformat()}
Alert Level: {alert_level}
Detections:
{detection_summary}

Write a 2-3 sentence alert for security personnel. Be specific and actionable."""
            }]
        )
        return response.content[0].text

    def process_frame(self, frame, camera_id="CAM-01"):
        """Full pipeline: detect → classify → describe."""
        detections = self.detect_objects(frame)
        alert_level = self.classify_alert_level(detections)

        if alert_level == "NONE":
            return None

        description = self.generate_alert_description(
            detections, alert_level, camera_id
        )
        return {
            "camera": camera_id,
            "level": alert_level,
            "detections": detections,
            "description": description,
            "timestamp": datetime.now().isoformat()
        }

# Usage with a video stream
pipeline = SecurityAlertPipeline()
cap = cv2.VideoCapture("rtsp://camera-feed-url")

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break
    alert = pipeline.process_frame(frame)
    if alert:
        print(f"[{alert['level']}] {alert['description']}")

Code Example: Chart Understanding with Vision + LLM

Extract data and insights from charts, graphs, and infographics by combining vision understanding with LLM reasoning.

Python - Chart Analysis
import anthropic
import base64

def analyze_chart(image_path: str, question: str = None) -> dict:
    """Extract data and insights from a chart image."""
    client = anthropic.Anthropic()

    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    prompt = """Analyze this chart/graph image. Extract:
1. Chart type (bar, line, pie, scatter, etc.)
2. Title and axis labels
3. All data points or series (as structured data)
4. Key trends and insights
5. Any anomalies or notable patterns

Return as JSON with keys: chart_type, title, axes, data_series, insights[]"""

    if question:
        prompt += f"\n\nAlso answer this specific question: {question}"

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_data
                }},
                {"type": "text", "text": prompt}
            ]
        }]
    )
    return json.loads(response.content[0].text)

# Example: Analyze a sales chart
result = analyze_chart("quarterly_sales.png", "Which quarter had the highest growth?")
for insight in result["insights"]:
    print(f"  - {insight}")

Video Analysis Pipeline

Processing video requires extracting frames, analyzing each (or sampled) frame, and then summarizing across the temporal dimension. This is essential for surveillance, content moderation, and video understanding tasks.

Python - Video Analysis with Frame Sampling
import cv2
import base64
import anthropic
from io import BytesIO

def extract_key_frames(video_path: str, interval_seconds: int = 5) -> list:
    """Extract frames at regular intervals from a video."""
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps * interval_seconds)
    frames = []
    frame_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        if frame_count % frame_interval == 0:
            _, buffer = cv2.imencode(".jpg", frame)
            b64 = base64.standard_b64encode(buffer).decode("utf-8")
            frames.append({
                "timestamp": frame_count / fps,
                "image_b64": b64
            })
        frame_count += 1

    cap.release()
    return frames

def analyze_video(video_path: str, question: str) -> str:
    """Analyze a video by sampling frames and using multimodal LLM."""
    client = anthropic.Anthropic()
    frames = extract_key_frames(video_path, interval_seconds=3)

    # Build content with multiple frames
    content = []
    for i, frame in enumerate(frames[:20]):  # Limit to 20 frames
        content.append({"type": "text", "text": f"Frame at {frame['timestamp']:.1f}s:"})
        content.append({
            "type": "image",
            "source": {"type": "base64", "media_type": "image/jpeg", "data": frame["image_b64"]}
        })

    content.append({
        "type": "text",
        "text": f"""These are frames sampled from a video ({len(frames)} total frames).
Analyze the video content temporally and answer: {question}
Include a timeline summary of key events."""
    })

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text

Grounding and Spatial Reasoning

Grounding refers to connecting language outputs to specific regions of an image. When an LLM says “the red car on the left,” grounding ensures it can point to exact pixel coordinates. This is critical for applications like robotics, autonomous driving, and interactive image editing.

  • Coordinate grounding: Models like Florence-2 and Kosmos-2 output bounding box coordinates alongside text descriptions
  • Referring expression comprehension: Given “the person wearing a blue hat,” locate them in the image
  • Spatial relationship reasoning: Understanding “above,” “next to,” “inside” relationships between objects
  • Set-of-Mark prompting: Overlay numbered markers on image regions, then reference by number in prompts

Vision Hallucination and Accuracy

Hallucination risk: Vision LLMs can confidently describe objects that are not present, misread text in images, or invent details. Always validate critical outputs against ground truth. For safety-critical applications (medical, autonomous driving), use dedicated vision models with known accuracy metrics rather than relying solely on multimodal LLMs.

Common hallucination patterns to watch for:

  • Object hallucination: Claiming objects exist that are not in the image (especially small or partially occluded objects)
  • Text misreading: OCR errors on stylized, rotated, or low-resolution text
  • Count errors: Incorrectly counting objects, especially when there are many similar items
  • Spatial confusion: Misidentifying left/right, relative positions, or distances
  • Confabulated details: Adding brand names, model numbers, or specifications not visible in the image

Cost Comparison

Approach Cost per 1K Images Latency (p50) Infrastructure
Claude Vision API $10–$50 2–4s None (managed API)
GPT-4V API $15–$65 3–6s None (managed API)
YOLO (self-hosted) + Claude $2–$8 + GPU cost 0.5–2s GPU server for YOLO
Florence-2 + Local LLM GPU cost only 1–3s GPU server for both models
SAM + Claude $5–$15 + GPU cost 2–5s GPU server for SAM
💡
Optimization tip: For high-volume applications, use a lightweight vision model (YOLO, MobileNet) as a pre-filter. Only send frames with interesting detections to the expensive multimodal LLM. This can reduce API costs by 80–95% for surveillance or monitoring use cases where most frames are uninteresting.