Beginner

Introduction to Embeddings

Embeddings are the bridge between human language and mathematics — they convert words, sentences, and documents into numerical vectors that capture semantic meaning.

What Are Embeddings?

An embedding is a numerical representation of data — a list of numbers (a vector) that captures the meaning of that data. When you embed a sentence like "The cat sat on the mat," you get a vector like [0.12, -0.34, 0.56, 0.78, ...] with hundreds or thousands of dimensions.

The magic is that similar meanings produce similar vectors. "The cat sat on the mat" and "A kitten rested on the rug" will have vectors that are close together in this high-dimensional space, even though they share few words.

💡
Key insight: Embeddings let computers understand meaning, not just words. This is the foundation of semantic search, RAG, recommendations, and most modern AI applications.

Why Embeddings Matter

Computers work with numbers, not words. Before embeddings, we had crude ways to represent text numerically:

  • Keyword matching: "Does the document contain this exact word?" No understanding of synonyms or meaning.
  • Bag of words: Count word frequencies. "Bank" (financial) and "bank" (river) are the same.
  • TF-IDF: Weighted word frequencies. Better, but still no semantic understanding.

Embeddings solved this by learning to represent meaning as geometry. Words, sentences, and documents become points in space where distance equals semantic similarity.

The History of Embeddings

Era Approach Key Innovation
Pre-2013 One-hot encoding Each word is a binary vector. "cat" = [0,0,1,0,...]. No semantic meaning. Huge, sparse vectors.
2013 Word2Vec (Google) First dense word embeddings. Learned from context. "King - Man + Woman = Queen" emerged naturally.
2014 GloVe (Stanford) Global word co-occurrence statistics. Complementary to Word2Vec with similar results.
2018 BERT (Google) Contextualized embeddings. The same word gets different vectors based on context. "Bank" in "river bank" vs "bank account."
2020+ Modern models Sentence-transformers, OpenAI embeddings, Voyage AI, Cohere Embed. Optimized for whole-sentence meaning.

Words as Points in Space

Imagine a 2D map where every word is a dot. On this map:

  • "King" and "Queen" are close together (both royalty).
  • "Dog" and "Cat" are close together (both pets/animals).
  • "King" and "Banana" are far apart (unrelated concepts).

Real embeddings work in hundreds or thousands of dimensions, but the principle is the same: meaning is encoded as position.

The King - Man + Woman = Queen Example

The most famous embedding discovery is that vector arithmetic captures analogies. Word2Vec showed that:

Vector Arithmetic
# The classic analogy test
vector("King") - vector("Man") + vector("Woman") ≈ vector("Queen")

# This works because:
# King - Man  = the concept of "royalty" (gender removed)
# + Woman     = add female gender back
# = Queen     = female royalty

# More examples:
# Paris - France + Italy ≈ Rome
# Walking - Walk + Swim ≈ Swimming
# Doctor - Man + Woman ≈ Nurse (note: reflects training data biases!)
Important caveat: Embeddings learn from training data, which means they can encode societal biases. The "Doctor - Man + Woman = Nurse" example reveals gender bias in the training corpus. Modern embedding models use debiasing techniques, but awareness of this limitation is important.

From Words to Sentences to Documents

Early embeddings (Word2Vec, GloVe) worked at the word level. Modern embedding models work at the sentence or document level, which is far more useful for applications:

  • Word embeddings: "bank" always has the same vector, regardless of context.
  • Sentence embeddings: "I went to the bank to deposit money" and "I sat on the river bank" give "bank" different meanings because the full context is considered.
  • Document embeddings: Entire paragraphs or pages are compressed into a single vector that captures the overall topic and meaning.

What Can You Do with Embeddings?

  • Semantic search: Find documents by meaning, not keywords.
  • RAG (Retrieval-Augmented Generation): Ground LLM responses in relevant documents.
  • Recommendations: Suggest similar items based on embedding proximity.
  • Clustering: Group similar documents automatically.
  • Classification: Categorize text based on embedding similarity to category examples.
  • Duplicate detection: Find near-duplicate content.
  • Anomaly detection: Identify outliers that do not match any cluster.

💡 Think About It

Consider the sentence "The bank approved my loan." What information would the embedding need to capture to distinguish this from "The river bank was muddy"?

The embedding model uses surrounding words ("approved," "loan" vs "river," "muddy") to understand context. Modern models process the entire input and produce a single vector that encodes this contextual meaning.