Multi-Modal Pretrained Models Intermediate

Multi-modal models process and generate content across multiple data types — text, images, audio, and video. These models can understand images and answer questions, generate images from text, analyze documents, and more.

Vision-Language Models

CLIP (OpenAI)

Contrastive Language-Image Pre-training. Learns to match images with text descriptions. Powers zero-shot image classification, image search, and similarity scoring.

Python

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("photo.jpg")
inputs = processor(text=["a cat", "a dog", "a car"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
print(probs)  # Probability for each label

LLaVA

Large Language and Vision Assistant. Combines a vision encoder with an LLM to understand images and answer questions about them in natural language.

Python

from transformers import pipeline

vlm = pipeline("image-text-to-text", model="llava-hf/llava-1.5-7b-hf")
result = vlm(image="photo.jpg", text="Describe this image in detail.")
print(result)

InternVL & Qwen-VL

InternVL (Shanghai AI Lab) and Qwen-VL (Alibaba) are powerful open-source vision-language models with strong OCR, chart understanding, and visual reasoning capabilities.

Image + Text Generation

Stable Diffusion (Text-to-Image)

Generates images from text prompts using a latent diffusion model. SDXL and SD 3.0 are the latest versions.

Python

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
).to("cuda")

image = pipe("A futuristic city at night, cyberpunk style, detailed").images[0]
image.save("cityscape.png")

DALL-E

OpenAI's text-to-image model. DALL-E 3 generates highly detailed, creative images. Available through the OpenAI API.

Video Understanding

VideoLLaMA

Extends LLaMA with video understanding capabilities. Can answer questions about video content, describe actions, and summarize scenes.

InternVideo

Video foundation model for action recognition, video-text retrieval, and video captioning. Trained on large-scale video-text data.

Document AI

LayoutLM (Microsoft)

Pre-trained model for document understanding. Combines text, layout (position), and image features to understand forms, invoices, receipts, and other documents.

Donut

Document understanding transformer that processes document images directly without OCR. End-to-end document parsing.

OCR Models

TrOCR (Microsoft)

Transformer-based OCR that combines an image Transformer encoder with a text Transformer decoder for text recognition.

Python

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed")

image = Image.open("text_image.png").convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(text)

EasyOCR & PaddleOCR

EasyOCR supports 80+ languages and is easy to set up. PaddleOCR (by Baidu) offers state-of-the-art accuracy for many languages, especially Chinese.

Multi-Modal Models Summary

Model	Modalities	Task	Best For
CLIP	Image + Text	Similarity, zero-shot classification	Image search, labeling
LLaVA	Image + Text	Visual Q&A, description	Image understanding
Stable Diffusion	Text → Image	Image generation	Creative content
LayoutLM	Document + Text	Document understanding	Form extraction
TrOCR	Image → Text	OCR	Text recognition

Next Up

Learn the practical steps for loading, running, and serving any pretrained model.

Next: Using Models →

← Audio Models Using Models →