Introduction to Embeddings
Embeddings are the bridge between human language and mathematics — they convert words, sentences, and documents into numerical vectors that capture semantic meaning.
What Are Embeddings?
An embedding is a numerical representation of data — a list of numbers (a vector) that captures the meaning of that data. When you embed a sentence like "The cat sat on the mat," you get a vector like [0.12, -0.34, 0.56, 0.78, ...] with hundreds or thousands of dimensions.
The magic is that similar meanings produce similar vectors. "The cat sat on the mat" and "A kitten rested on the rug" will have vectors that are close together in this high-dimensional space, even though they share few words.
Why Embeddings Matter
Computers work with numbers, not words. Before embeddings, we had crude ways to represent text numerically:
- Keyword matching: "Does the document contain this exact word?" No understanding of synonyms or meaning.
- Bag of words: Count word frequencies. "Bank" (financial) and "bank" (river) are the same.
- TF-IDF: Weighted word frequencies. Better, but still no semantic understanding.
Embeddings solved this by learning to represent meaning as geometry. Words, sentences, and documents become points in space where distance equals semantic similarity.
The History of Embeddings
| Era | Approach | Key Innovation |
|---|---|---|
| Pre-2013 | One-hot encoding | Each word is a binary vector. "cat" = [0,0,1,0,...]. No semantic meaning. Huge, sparse vectors. |
| 2013 | Word2Vec (Google) | First dense word embeddings. Learned from context. "King - Man + Woman = Queen" emerged naturally. |
| 2014 | GloVe (Stanford) | Global word co-occurrence statistics. Complementary to Word2Vec with similar results. |
| 2018 | BERT (Google) | Contextualized embeddings. The same word gets different vectors based on context. "Bank" in "river bank" vs "bank account." |
| 2020+ | Modern models | Sentence-transformers, OpenAI embeddings, Voyage AI, Cohere Embed. Optimized for whole-sentence meaning. |
Words as Points in Space
Imagine a 2D map where every word is a dot. On this map:
- "King" and "Queen" are close together (both royalty).
- "Dog" and "Cat" are close together (both pets/animals).
- "King" and "Banana" are far apart (unrelated concepts).
Real embeddings work in hundreds or thousands of dimensions, but the principle is the same: meaning is encoded as position.
The King - Man + Woman = Queen Example
The most famous embedding discovery is that vector arithmetic captures analogies. Word2Vec showed that:
# The classic analogy test
vector("King") - vector("Man") + vector("Woman") ≈ vector("Queen")
# This works because:
# King - Man = the concept of "royalty" (gender removed)
# + Woman = add female gender back
# = Queen = female royalty
# More examples:
# Paris - France + Italy ≈ Rome
# Walking - Walk + Swim ≈ Swimming
# Doctor - Man + Woman ≈ Nurse (note: reflects training data biases!)
From Words to Sentences to Documents
Early embeddings (Word2Vec, GloVe) worked at the word level. Modern embedding models work at the sentence or document level, which is far more useful for applications:
- Word embeddings: "bank" always has the same vector, regardless of context.
- Sentence embeddings: "I went to the bank to deposit money" and "I sat on the river bank" give "bank" different meanings because the full context is considered.
- Document embeddings: Entire paragraphs or pages are compressed into a single vector that captures the overall topic and meaning.
What Can You Do with Embeddings?
- Semantic search: Find documents by meaning, not keywords.
- RAG (Retrieval-Augmented Generation): Ground LLM responses in relevant documents.
- Recommendations: Suggest similar items based on embedding proximity.
- Clustering: Group similar documents automatically.
- Classification: Categorize text based on embedding similarity to category examples.
- Duplicate detection: Find near-duplicate content.
- Anomaly detection: Identify outliers that do not match any cluster.
💡 Think About It
Consider the sentence "The bank approved my loan." What information would the embedding need to capture to distinguish this from "The river bank was muddy"?