Multi-Modal Pretrained Models Intermediate

Multi-modal models process and generate content across multiple data types — text, images, audio, and video. These models can understand images and answer questions, generate images from text, analyze documents, and more.

Vision-Language Models

CLIP (OpenAI)

Contrastive Language-Image Pre-training. Learns to match images with text descriptions. Powers zero-shot image classification, image search, and similarity scoring.

Python
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("photo.jpg")
inputs = processor(text=["a cat", "a dog", "a car"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
print(probs)  # Probability for each label

LLaVA

Large Language and Vision Assistant. Combines a vision encoder with an LLM to understand images and answer questions about them in natural language.

Python
from transformers import pipeline

vlm = pipeline("image-text-to-text", model="llava-hf/llava-1.5-7b-hf")
result = vlm(image="photo.jpg", text="Describe this image in detail.")
print(result)

InternVL & Qwen-VL

InternVL (Shanghai AI Lab) and Qwen-VL (Alibaba) are powerful open-source vision-language models with strong OCR, chart understanding, and visual reasoning capabilities.

Image + Text Generation

Stable Diffusion (Text-to-Image)

Generates images from text prompts using a latent diffusion model. SDXL and SD 3.0 are the latest versions.

Python
from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
).to("cuda")

image = pipe("A futuristic city at night, cyberpunk style, detailed").images[0]
image.save("cityscape.png")

DALL-E

OpenAI's text-to-image model. DALL-E 3 generates highly detailed, creative images. Available through the OpenAI API.

Video Understanding

VideoLLaMA

Extends LLaMA with video understanding capabilities. Can answer questions about video content, describe actions, and summarize scenes.

InternVideo

Video foundation model for action recognition, video-text retrieval, and video captioning. Trained on large-scale video-text data.

Document AI

LayoutLM (Microsoft)

Pre-trained model for document understanding. Combines text, layout (position), and image features to understand forms, invoices, receipts, and other documents.

Donut

Document understanding transformer that processes document images directly without OCR. End-to-end document parsing.

OCR Models

TrOCR (Microsoft)

Transformer-based OCR that combines an image Transformer encoder with a text Transformer decoder for text recognition.

Python
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed")

image = Image.open("text_image.png").convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(text)

EasyOCR & PaddleOCR

EasyOCR supports 80+ languages and is easy to set up. PaddleOCR (by Baidu) offers state-of-the-art accuracy for many languages, especially Chinese.

Multi-Modal Models Summary

ModelModalitiesTaskBest For
CLIPImage + TextSimilarity, zero-shot classificationImage search, labeling
LLaVAImage + TextVisual Q&A, descriptionImage understanding
Stable DiffusionText → ImageImage generationCreative content
LayoutLMDocument + TextDocument understandingForm extraction
TrOCRImage → TextOCRText recognition

Next Up

Learn the practical steps for loading, running, and serving any pretrained model.

Next: Using Models →