Beginner

How Vector Databases Work

Dive into the algorithms and data structures that make fast similarity search possible — indexing strategies, distance metrics, and the trade-offs you need to understand.

The Core Problem: Nearest Neighbor Search

Given a query vector, find the k closest vectors in a dataset of millions or billions. A brute-force approach compares the query to every stored vector — accurate but impossibly slow at scale. Vector databases solve this with Approximate Nearest Neighbor (ANN) algorithms that build index structures for sub-linear search time.

Vector Indexing Algorithms

HNSW (Hierarchical Navigable Small World)

HNSW is the most popular indexing algorithm used by Pinecone, Weaviate, Qdrant, and pgvector. It builds a multi-layer graph where each node is a vector and edges connect nearby vectors.

How it works: The top layer contains a sparse graph of "highway" connections for fast global navigation. Lower layers add more nodes and local connections. Search starts at the top and greedily descends through layers.
Pros: Excellent recall (accuracy), fast query times, no training phase needed.
Cons: High memory usage (stores the full graph in RAM), slow index build time.

💡

HNSW parameters: M controls the number of connections per node (higher = more accurate but more memory). ef_construction controls build-time accuracy. ef_search controls query-time accuracy vs speed.

IVF (Inverted File Index)

IVF partitions vectors into clusters using k-means, then searches only the most relevant clusters during query time.

How it works: During indexing, vectors are assigned to the nearest centroid. During search, the query is compared to centroids first, then only vectors in the closest nprobe clusters are searched.
Pros: Lower memory than HNSW, good for very large datasets, well-suited for GPU acceleration.
Cons: Requires training (k-means clustering), lower recall than HNSW at similar speed.

LSH (Locality Sensitive Hashing)

LSH uses hash functions that map similar vectors to the same bucket with high probability.

How it works: Multiple hash functions create "signatures" for each vector. Similar vectors collide in the same bucket. Search only compares vectors within matching buckets.
Pros: Very fast for high-dimensional data, easy to distribute.
Cons: Lower recall than HNSW and IVF, requires careful tuning of hash functions.

Product Quantization (PQ)

PQ compresses vectors by splitting them into sub-vectors and quantizing each independently. This is often combined with IVF (IVF-PQ) to reduce memory usage dramatically.

How it works: A 1536-dimension vector is split into, say, 96 sub-vectors of 16 dimensions each. Each sub-vector is replaced by the index of its nearest centroid from a codebook.
Pros: 10–100x memory reduction, enables billion-scale search on a single machine.
Cons: Lossy compression — reduces recall accuracy. Requires training codebooks.

Distance Metrics

Distance metrics determine how "closeness" between vectors is calculated. The choice of metric matters and should match how your embedding model was trained.

Metric	Formula	Range	Best For
Cosine Similarity	`A · B / (\|\|A\|\| * \|\|B\|\|)`	-1 to 1	Text embeddings (most common)
Euclidean (L2)	`sqrt(sum((A-B)^2))`	0 to ∞	Image embeddings, spatial data
Dot Product	`sum(A * B)`	-∞ to ∞	Normalized vectors, recommendation

Python - Computing Distance Metrics

import numpy as np

a = np.array([0.1, 0.3, 0.5, 0.7])
b = np.array([0.2, 0.4, 0.4, 0.6])

# Cosine similarity
cosine_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Cosine similarity: {cosine_sim:.4f}")  # 0.9899

# Euclidean distance
euclidean = np.linalg.norm(a - b)
print(f"Euclidean distance: {euclidean:.4f}")  # 0.2000

# Dot product
dot = np.dot(a, b)
print(f"Dot product: {dot:.4f}")  # 0.7600

✅

Which metric to use? If your vectors are normalized (unit length), cosine similarity and dot product give the same ranking. Most text embedding models (OpenAI, Cohere, Voyage) produce normalized vectors, so cosine similarity is the standard choice. Check your model's documentation.

Trade-offs: Accuracy vs Speed vs Memory

Algorithm	Query Speed	Recall (Accuracy)	Memory Usage	Build Time
Brute Force	Slow	100% (exact)	Low	None
HNSW	Very Fast	95–99%	High	Slow
IVF	Fast	90–97%	Medium	Medium
IVF-PQ	Fast	85–95%	Very Low	Medium
LSH	Very Fast	80–90%	Medium	Fast

Filtering with Metadata

Real applications rarely search only by vector similarity. You almost always need to combine similarity search with metadata filters (e.g., "find similar products, but only in the Electronics category, priced under $100").

Vector databases support two filtering strategies:

Pre-filtering: Apply metadata filters first, then search only the matching vectors. Guarantees all results match the filter, but may reduce search quality if too few vectors remain.
Post-filtering: Search all vectors first, then remove results that do not match the filter. Better search quality, but may return fewer than k results.

Conceptual - Filtered Vector Search

# Pseudo-code for filtered similarity search
results = vector_db.query(
    vector=query_embedding,
    top_k=10,
    filter={
        "category": "electronics",
        "price": {"$lt": 100},
        "in_stock": True
    }
)
# Returns the 10 most similar vectors that also
# match ALL metadata filter conditions

💡 Quick Quiz

If you have 100 million vectors and need sub-10ms query latency with 95%+ recall, which indexing algorithm would you choose and why?

HNSW is generally the best choice here. Its 95-99% recall at very fast query speeds makes it ideal, provided you have enough RAM. Most production vector databases default to HNSW for this reason.

← Previous Introduction Next → Pinecone