How Vector Databases Work
Dive into the algorithms and data structures that make fast similarity search possible — indexing strategies, distance metrics, and the trade-offs you need to understand.
The Core Problem: Nearest Neighbor Search
Given a query vector, find the k closest vectors in a dataset of millions or billions. A brute-force approach compares the query to every stored vector — accurate but impossibly slow at scale. Vector databases solve this with Approximate Nearest Neighbor (ANN) algorithms that build index structures for sub-linear search time.
Vector Indexing Algorithms
HNSW (Hierarchical Navigable Small World)
HNSW is the most popular indexing algorithm used by Pinecone, Weaviate, Qdrant, and pgvector. It builds a multi-layer graph where each node is a vector and edges connect nearby vectors.
- How it works: The top layer contains a sparse graph of "highway" connections for fast global navigation. Lower layers add more nodes and local connections. Search starts at the top and greedily descends through layers.
- Pros: Excellent recall (accuracy), fast query times, no training phase needed.
- Cons: High memory usage (stores the full graph in RAM), slow index build time.
M controls the number of connections per node (higher = more accurate but more memory). ef_construction controls build-time accuracy. ef_search controls query-time accuracy vs speed.IVF (Inverted File Index)
IVF partitions vectors into clusters using k-means, then searches only the most relevant clusters during query time.
- How it works: During indexing, vectors are assigned to the nearest centroid. During search, the query is compared to centroids first, then only vectors in the closest
nprobeclusters are searched. - Pros: Lower memory than HNSW, good for very large datasets, well-suited for GPU acceleration.
- Cons: Requires training (k-means clustering), lower recall than HNSW at similar speed.
LSH (Locality Sensitive Hashing)
LSH uses hash functions that map similar vectors to the same bucket with high probability.
- How it works: Multiple hash functions create "signatures" for each vector. Similar vectors collide in the same bucket. Search only compares vectors within matching buckets.
- Pros: Very fast for high-dimensional data, easy to distribute.
- Cons: Lower recall than HNSW and IVF, requires careful tuning of hash functions.
Product Quantization (PQ)
PQ compresses vectors by splitting them into sub-vectors and quantizing each independently. This is often combined with IVF (IVF-PQ) to reduce memory usage dramatically.
- How it works: A 1536-dimension vector is split into, say, 96 sub-vectors of 16 dimensions each. Each sub-vector is replaced by the index of its nearest centroid from a codebook.
- Pros: 10–100x memory reduction, enables billion-scale search on a single machine.
- Cons: Lossy compression — reduces recall accuracy. Requires training codebooks.
Distance Metrics
Distance metrics determine how "closeness" between vectors is calculated. The choice of metric matters and should match how your embedding model was trained.
| Metric | Formula | Range | Best For |
|---|---|---|---|
| Cosine Similarity | A · B / (||A|| * ||B||) |
-1 to 1 | Text embeddings (most common) |
| Euclidean (L2) | sqrt(sum((A-B)^2)) |
0 to ∞ | Image embeddings, spatial data |
| Dot Product | sum(A * B) |
-∞ to ∞ | Normalized vectors, recommendation |
import numpy as np
a = np.array([0.1, 0.3, 0.5, 0.7])
b = np.array([0.2, 0.4, 0.4, 0.6])
# Cosine similarity
cosine_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Cosine similarity: {cosine_sim:.4f}") # 0.9899
# Euclidean distance
euclidean = np.linalg.norm(a - b)
print(f"Euclidean distance: {euclidean:.4f}") # 0.2000
# Dot product
dot = np.dot(a, b)
print(f"Dot product: {dot:.4f}") # 0.7600
Trade-offs: Accuracy vs Speed vs Memory
| Algorithm | Query Speed | Recall (Accuracy) | Memory Usage | Build Time |
|---|---|---|---|---|
| Brute Force | Slow | 100% (exact) | Low | None |
| HNSW | Very Fast | 95–99% | High | Slow |
| IVF | Fast | 90–97% | Medium | Medium |
| IVF-PQ | Fast | 85–95% | Very Low | Medium |
| LSH | Very Fast | 80–90% | Medium | Fast |
Filtering with Metadata
Real applications rarely search only by vector similarity. You almost always need to combine similarity search with metadata filters (e.g., "find similar products, but only in the Electronics category, priced under $100").
Vector databases support two filtering strategies:
- Pre-filtering: Apply metadata filters first, then search only the matching vectors. Guarantees all results match the filter, but may reduce search quality if too few vectors remain.
- Post-filtering: Search all vectors first, then remove results that do not match the filter. Better search quality, but may return fewer than k results.
# Pseudo-code for filtered similarity search
results = vector_db.query(
vector=query_embedding,
top_k=10,
filter={
"category": "electronics",
"price": {"$lt": 100},
"in_stock": True
}
)
# Returns the 10 most similar vectors that also
# match ALL metadata filter conditions
💡 Quick Quiz
If you have 100 million vectors and need sub-10ms query latency with 95%+ recall, which indexing algorithm would you choose and why?
Lilly Tech Systems