Vision + LLM Applications
Combining computer vision models with large language models unlocks powerful multimodal applications — from visual question answering and product recognition to automated defect detection with natural language reports.
The Vision + Language Architecture
At its core, a vision + LLM application takes visual input (images or video), extracts structured understanding through a vision model, then feeds that understanding to a language model for reasoning, generation, or decision-making. There are two fundamental approaches:
# Approach 1: Pipeline Architecture Image/Video → Vision Model (YOLO, SAM, OCR, Florence-2) → Structured Data (bounding boxes, labels, text) → LLM (reasoning, report generation) → Final Output # Approach 2: Native Multimodal Image/Video + Text Prompt → Multimodal LLM (GPT-4V, Claude Vision, Gemini Pro Vision) → Final Output (text, structured JSON, decisions)
Native Multimodal vs Pipeline Approach
Choosing between a native multimodal model and a dedicated vision + LLM pipeline depends on your accuracy requirements, latency budget, and the specificity of your vision task.
| Factor | Native Multimodal (GPT-4V, Claude Vision) | Pipeline (YOLO/SAM + LLM) |
|---|---|---|
| Setup complexity | Low — single API call | High — multiple models, data marshaling |
| Latency | 1–5s per image | 0.1–2s (vision) + 0.5–3s (LLM) |
| Domain accuracy | Good general understanding | Superior for specialized tasks (defects, medical) |
| Fine-tuning | Limited or unavailable | Full control over vision model training |
| Cost per image | $0.01–$0.05 (API pricing) | $0.001–$0.01 (self-hosted) + LLM cost |
| Spatial precision | Approximate bounding descriptions | Pixel-level coordinates and masks |
| Offline capability | No (cloud API required) | Yes (vision model can run locally) |
| Best for | General VQA, captioning, document understanding | Manufacturing QC, medical imaging, real-time detection |
Key Models and Their Roles
- GPT-4V / GPT-4o: Native multimodal — accepts images directly in prompts, strong at general visual reasoning and OCR
- Claude Vision (Claude Sonnet 4 / Opus 4): Native multimodal with strong structured output, excels at document and chart understanding
- Gemini Pro Vision: Google's multimodal model with long-context image support and video understanding
- YOLOv8/v9: Real-time object detection and classification, ideal for production pipelines requiring speed
- SAM (Segment Anything Model): Zero-shot image segmentation — isolate any object with point or box prompts
- Florence-2: Microsoft's unified vision model handling captioning, detection, segmentation, and OCR in one architecture
- PaddleOCR / Tesseract: Dedicated OCR engines for high-accuracy text extraction from images
Use Cases
- Visual Question Answering (VQA): Users upload an image and ask natural language questions about its content
- Image Captioning & Alt Text: Automatically generate accessible descriptions for web images
- Product Recognition: Identify products from photos for inventory management or e-commerce search
- Defect Detection + Reporting: Vision model detects anomalies, LLM generates human-readable inspection reports
- Medical Imaging + Diagnosis Assistance: Analyze X-rays or scans, then generate structured findings for radiologists
- Chart/Graph Understanding: Extract data from visual charts and answer questions about trends
Code Example: Product Analyzer with Claude Vision
This example sends a product image to Claude's multimodal API and receives structured product information including name, category, estimated price range, and condition.
import anthropic import base64 import json from pathlib import Path client = anthropic.Anthropic() def analyze_product(image_path: str) -> dict: """Analyze a product image and return structured information.""" # Read and encode the image image_data = base64.standard_b64encode( Path(image_path).read_bytes() ).decode("utf-8") # Determine media type from extension suffix = Path(image_path).suffix.lower() media_types = {".jpg": "image/jpeg", ".png": "image/png", ".gif": "image/gif", ".webp": "image/webp"} media_type = media_types.get(suffix, "image/jpeg") response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": media_type, "data": image_data } }, { "type": "text", "text": """Analyze this product image. Return a JSON object with: { "product_name": "identified product name", "category": "product category", "brand": "brand if visible, else null", "condition": "new/used/refurbished/unknown", "estimated_price_range": {"min": 0, "max": 0, "currency": "USD"}, "key_features": ["feature1", "feature2"], "description": "one paragraph product description", "confidence": 0.0 to 1.0 } Return ONLY valid JSON, no markdown.""" } ] }] ) return json.loads(response.content[0].text) # Usage result = analyze_product("product_photo.jpg") print(f"Product: {result['product_name']}") print(f"Category: {result['category']}") print(f"Price: ${result['estimated_price_range']['min']}-${result['estimated_price_range']['max']}") print(f"Features: {', '.join(result['key_features'])}")
Code Example: Security Camera Pipeline (YOLO + LLM)
This pipeline uses YOLOv8 for real-time object detection on security camera frames, classifies detected events, and uses an LLM to generate human-readable alert descriptions.
from ultralytics import YOLO import anthropic import cv2 from datetime import datetime class SecurityAlertPipeline: def __init__(self): self.yolo = YOLO("yolov8n.pt") # Nano model for speed self.llm = anthropic.Anthropic() self.alert_classes = {"person", "car", "truck", "knife", "backpack"} def detect_objects(self, frame): """Run YOLO detection on a single frame.""" results = self.yolo(frame, conf=0.5, verbose=False) detections = [] for r in results: for box in r.boxes: cls_name = self.yolo.names[int(box.cls)] detections.append({ "class": cls_name, "confidence": float(box.conf), "bbox": box.xyxy[0].tolist() }) return detections def classify_alert_level(self, detections): """Classify the security alert level based on detections.""" classes_found = {d["class"] for d in detections} if classes_found & {"knife"}: return "HIGH" if "person" in classes_found and len(detections) > 3: return "MEDIUM" if classes_found & self.alert_classes: return "LOW" return "NONE" def generate_alert_description(self, detections, alert_level, camera_id): """Use LLM to generate a human-readable alert description.""" detection_summary = "\n".join( f"- {d['class']} (confidence: {d['confidence']:.1%}) at position {d['bbox']}" for d in detections ) response = self.llm.messages.create( model="claude-sonnet-4-20250514", max_tokens=300, messages=[{ "role": "user", "content": f"""Generate a brief security alert based on these detections: Camera: {camera_id} Time: {datetime.now().isoformat()} Alert Level: {alert_level} Detections: {detection_summary} Write a 2-3 sentence alert for security personnel. Be specific and actionable.""" }] ) return response.content[0].text def process_frame(self, frame, camera_id="CAM-01"): """Full pipeline: detect → classify → describe.""" detections = self.detect_objects(frame) alert_level = self.classify_alert_level(detections) if alert_level == "NONE": return None description = self.generate_alert_description( detections, alert_level, camera_id ) return { "camera": camera_id, "level": alert_level, "detections": detections, "description": description, "timestamp": datetime.now().isoformat() } # Usage with a video stream pipeline = SecurityAlertPipeline() cap = cv2.VideoCapture("rtsp://camera-feed-url") while cap.isOpened(): ret, frame = cap.read() if not ret: break alert = pipeline.process_frame(frame) if alert: print(f"[{alert['level']}] {alert['description']}")
Code Example: Chart Understanding with Vision + LLM
Extract data and insights from charts, graphs, and infographics by combining vision understanding with LLM reasoning.
import anthropic import base64 def analyze_chart(image_path: str, question: str = None) -> dict: """Extract data and insights from a chart image.""" client = anthropic.Anthropic() with open(image_path, "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8") prompt = """Analyze this chart/graph image. Extract: 1. Chart type (bar, line, pie, scatter, etc.) 2. Title and axis labels 3. All data points or series (as structured data) 4. Key trends and insights 5. Any anomalies or notable patterns Return as JSON with keys: chart_type, title, axes, data_series, insights[]""" if question: prompt += f"\n\nAlso answer this specific question: {question}" response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=2048, messages=[{ "role": "user", "content": [ {"type": "image", "source": { "type": "base64", "media_type": "image/png", "data": image_data }}, {"type": "text", "text": prompt} ] }] ) return json.loads(response.content[0].text) # Example: Analyze a sales chart result = analyze_chart("quarterly_sales.png", "Which quarter had the highest growth?") for insight in result["insights"]: print(f" - {insight}")
Video Analysis Pipeline
Processing video requires extracting frames, analyzing each (or sampled) frame, and then summarizing across the temporal dimension. This is essential for surveillance, content moderation, and video understanding tasks.
import cv2 import base64 import anthropic from io import BytesIO def extract_key_frames(video_path: str, interval_seconds: int = 5) -> list: """Extract frames at regular intervals from a video.""" cap = cv2.VideoCapture(video_path) fps = cap.get(cv2.CAP_PROP_FPS) frame_interval = int(fps * interval_seconds) frames = [] frame_count = 0 while cap.isOpened(): ret, frame = cap.read() if not ret: break if frame_count % frame_interval == 0: _, buffer = cv2.imencode(".jpg", frame) b64 = base64.standard_b64encode(buffer).decode("utf-8") frames.append({ "timestamp": frame_count / fps, "image_b64": b64 }) frame_count += 1 cap.release() return frames def analyze_video(video_path: str, question: str) -> str: """Analyze a video by sampling frames and using multimodal LLM.""" client = anthropic.Anthropic() frames = extract_key_frames(video_path, interval_seconds=3) # Build content with multiple frames content = [] for i, frame in enumerate(frames[:20]): # Limit to 20 frames content.append({"type": "text", "text": f"Frame at {frame['timestamp']:.1f}s:"}) content.append({ "type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": frame["image_b64"]} }) content.append({ "type": "text", "text": f"""These are frames sampled from a video ({len(frames)} total frames). Analyze the video content temporally and answer: {question} Include a timeline summary of key events.""" }) response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=2048, messages=[{"role": "user", "content": content}] ) return response.content[0].text
Grounding and Spatial Reasoning
Grounding refers to connecting language outputs to specific regions of an image. When an LLM says “the red car on the left,” grounding ensures it can point to exact pixel coordinates. This is critical for applications like robotics, autonomous driving, and interactive image editing.
- Coordinate grounding: Models like Florence-2 and Kosmos-2 output bounding box coordinates alongside text descriptions
- Referring expression comprehension: Given “the person wearing a blue hat,” locate them in the image
- Spatial relationship reasoning: Understanding “above,” “next to,” “inside” relationships between objects
- Set-of-Mark prompting: Overlay numbered markers on image regions, then reference by number in prompts
Vision Hallucination and Accuracy
Common hallucination patterns to watch for:
- Object hallucination: Claiming objects exist that are not in the image (especially small or partially occluded objects)
- Text misreading: OCR errors on stylized, rotated, or low-resolution text
- Count errors: Incorrectly counting objects, especially when there are many similar items
- Spatial confusion: Misidentifying left/right, relative positions, or distances
- Confabulated details: Adding brand names, model numbers, or specifications not visible in the image
Cost Comparison
| Approach | Cost per 1K Images | Latency (p50) | Infrastructure |
|---|---|---|---|
| Claude Vision API | $10–$50 | 2–4s | None (managed API) |
| GPT-4V API | $15–$65 | 3–6s | None (managed API) |
| YOLO (self-hosted) + Claude | $2–$8 + GPU cost | 0.5–2s | GPU server for YOLO |
| Florence-2 + Local LLM | GPU cost only | 1–3s | GPU server for both models |
| SAM + Claude | $5–$15 + GPU cost | 2–5s | GPU server for SAM |
Lilly Tech Systems