Lesson 3 of 13

Embedding Models

Transform text, images, and code into numerical vectors that capture semantic meaning

What Are Embeddings?

Embeddings are numerical vector representations of data — text, images, audio, or code — that capture the semantic meaning of that data in a high-dimensional space. Instead of treating words as arbitrary symbols, embedding models convert them into dense arrays of floating-point numbers where similar concepts are close together and dissimilar concepts are far apart.

Think of it this way: the words "king" and "queen" would have vectors that are near each other in embedding space, while "king" and "refrigerator" would be far apart. This mathematical representation allows machines to understand and compare meaning in ways that raw text cannot support.

Key Insight: Embeddings are the bridge between human language and mathematical computation. They let you perform operations like "find documents similar to this query" or "cluster these articles by topic" using pure vector math.

How Embedding Models Work

Embedding models use deep neural networks — typically transformer architectures — to encode input data into dense, fixed-length vectors. Here is the general process:

Tokenization: Input text is split into tokens (words, subwords, or characters).
Encoding: Tokens pass through multiple transformer layers that learn contextual relationships between them.
Pooling: The model aggregates token-level representations into a single vector (using mean pooling, CLS token pooling, or other strategies).
Normalization: The final vector is often L2-normalized so that cosine similarity can be computed efficiently.

The resulting vector — typically between 256 and 3072 dimensions — encodes the semantic meaning of the entire input. Two pieces of text with similar meaning will produce vectors with a high cosine similarity score (close to 1.0), even if they use completely different words.

Important: Embeddings are not the same as LLM outputs. An LLM generates text; an embedding model generates a fixed-size numerical vector. They serve fundamentally different purposes.

Types of Embeddings

Text Embeddings

The most common type. These models convert sentences, paragraphs, or entire documents into vectors. They are the foundation of semantic search, retrieval-augmented generation (RAG), and document clustering.

Image Embeddings

Vision models like CLIP produce vectors for images, enabling you to search for images using text queries ("a cat sitting on a laptop") or find visually similar images. Image embeddings power reverse image search and visual recommendation systems.

Code Embeddings

Specialized models encode source code into vectors, capturing programming logic and structure. These are used for code search, duplicate detection, and vulnerability scanning. Models like CodeBERT and StarCoder produce code-aware embeddings.

Multimodal Embeddings

Models like CLIP and ImageBind produce embeddings that place text, images, and sometimes audio into the same vector space, enabling cross-modal search and retrieval.

Key Embedding Models

The embedding model landscape has evolved rapidly. Here are the most important models to know:

Model	Provider	Dimensions	Max Tokens	Best For
text-embedding-3-large	OpenAI	3072	8191	High-accuracy retrieval, RAG
text-embedding-3-small	OpenAI	1536	8191	Cost-effective general purpose
Cohere Embed v3	Cohere	1024	512	Multilingual, search & classification
BGE-large-en-v1.5	BAAI	1024	512	Open-source, self-hosted
E5-large-v2	Microsoft	1024	512	Open-source, strong MTEB scores
GTE-large	Alibaba	1024	512	Open-source, multilingual
Sentence-BERT	UKP Lab	768	512	Lightweight, fast inference
CLIP (ViT-L/14)	OpenAI	768	77	Text-image cross-modal search

Choosing dimensions: Higher dimensions generally capture more nuance but require more storage and compute. OpenAI's text-embedding-3 models support dimension reduction via the Matryoshka technique — you can request 256, 512, or 1024 dimensions instead of the full size while retaining most quality.

Use Cases for Embedding Models

Semantic Search

Traditional keyword search fails when users phrase queries differently from the source documents. Embedding-based search compares meaning, not words. A query like "how to fix a broken pipe" will match documents about plumbing repairs even if those documents never use the exact phrase.

Retrieval-Augmented Generation (RAG)

RAG is the most popular enterprise AI pattern today. You embed your documents into a vector database, then at query time, retrieve the most relevant chunks and pass them as context to an LLM. Embedding quality directly determines RAG quality.

Clustering and Topic Modeling

Embed a corpus of documents, then apply clustering algorithms (K-means, HDBSCAN) to group them by topic. This works without any predefined categories — the clusters emerge from the semantic structure of the data itself.

Deduplication

Find near-duplicate documents, support tickets, or records by computing embeddings and identifying pairs with very high cosine similarity (e.g., > 0.95). This catches duplicates that differ in phrasing but express the same content.

Recommendation Systems

Embed product descriptions, articles, or user profiles. Then recommend items whose embeddings are closest to the user's interest vector. This enables content-based recommendations without collaborative filtering data.

Vector Databases

Embeddings need to be stored and searched efficiently. Vector databases are purpose-built for this, supporting approximate nearest neighbor (ANN) search over millions or billions of vectors.

Database	Type	Key Features
Pinecone	Managed cloud	Serverless, auto-scaling, metadata filtering
Weaviate	Open source / cloud	Hybrid search, GraphQL API, modules
ChromaDB	Open source	Lightweight, Python-native, great for prototyping
Qdrant	Open source / cloud	Rust-based, fast, advanced filtering
pgvector	PostgreSQL extension	Use existing Postgres infrastructure, SQL interface
Milvus	Open source / cloud	Highly scalable, GPU-accelerated, enterprise-ready

Code Example: Generating Embeddings with OpenAI

Here is how to generate text embeddings using the OpenAI API in Python:

Python

from openai import OpenAI

client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
    """Generate an embedding vector for the given text."""
    text = text.replace("\n", " ")
    response = client.embeddings.create(
        input=[text],
        model=model
    )
    return response.data[0].embedding

# Generate embeddings for sample texts
texts = [
    "The cat sat on the mat",
    "A kitten was resting on the rug",
    "Python is a programming language",
]

embeddings = [get_embedding(t) for t in texts]

print(f"Vector dimensions: {len(embeddings[0])}")
print(f"First 5 values: {embeddings[0][:5]}")
# Output: Vector dimensions: 1536
# Output: First 5 values: [0.0023, -0.0091, 0.0152, ...]

Code Example: Cosine Similarity Calculation

Once you have embeddings, you can compute how similar two texts are using cosine similarity:

Python

import numpy as np

def cosine_similarity(vec_a, vec_b):
    """Compute cosine similarity between two vectors."""
    dot_product = np.dot(vec_a, vec_b)
    norm_a = np.linalg.norm(vec_a)
    norm_b = np.linalg.norm(vec_b)
    return dot_product / (norm_a * norm_b)

# Compare our three texts
sim_cat_kitten = cosine_similarity(embeddings[0], embeddings[1])
sim_cat_python = cosine_similarity(embeddings[0], embeddings[2])

print(f"'cat on mat' vs 'kitten on rug': {sim_cat_kitten:.4f}")
print(f"'cat on mat' vs 'Python language': {sim_cat_python:.4f}")
# Output: 'cat on mat' vs 'kitten on rug': 0.8742
# Output: 'cat on mat' vs 'Python language': 0.1523

# Find the most similar text to a query
query = "Where is the feline?"
query_embedding = get_embedding(query)

similarities = [
    (text, cosine_similarity(query_embedding, emb))
    for text, emb in zip(texts, embeddings)
]

# Sort by similarity (highest first)
similarities.sort(key=lambda x: x[1], reverse=True)
for text, score in similarities:
    print(f"  {score:.4f} — {text}")

Performance tip: For production workloads, avoid computing cosine similarity manually. Use a vector database with ANN indexes (HNSW, IVF) that can search millions of vectors in milliseconds.

When to Use Embedding Models

Embedding models are the right choice when you need to:

Search by meaning rather than keywords — semantic search and question answering over documents
Build a RAG pipeline — embedding your knowledge base is the first step in retrieval-augmented generation
Group or classify text without labeled training data — clustering and zero-shot classification via embeddings
Detect duplicates or find similar items across large datasets
Build recommendations based on content similarity rather than user behavior

Embedding models are not the right choice when you need to generate text, translate languages, or perform complex reasoning. For those tasks, use an LLM instead. Embeddings capture what something means; LLMs determine what to say about it.

Summary

Embeddings convert data into dense numerical vectors that capture semantic meaning.
Modern embedding models use transformer architectures and produce vectors with 256 to 3072 dimensions.
Key models include OpenAI text-embedding-3, Cohere Embed v3, BGE, E5, GTE, and CLIP for multimodal.
Primary use cases are semantic search, RAG, clustering, deduplication, and recommendations.
Vector databases like Pinecone, Weaviate, ChromaDB, and Qdrant are essential for production embedding workloads.
Cosine similarity is the standard metric for comparing embedding vectors.

← Previous: Large Language Models Next: Vision Models →