Vector Databases & Retrieval
Vector databases are the backbone of retrieval-augmented generation (RAG) and multi-model AI applications. They store high-dimensional embeddings and enable fast similarity search across millions of documents, images, and other data types.
Why Vector Databases Matter
LLMs have knowledge cutoffs and limited context windows. Vector databases solve this by giving your AI application access to external knowledge through semantic search:
- Semantic search: Find documents by meaning, not just keywords. "How do I fix a memory leak?" matches "debugging OOM errors in Java" even though they share no words.
- RAG pipelines: Retrieve relevant context before generating a response, grounding the LLM in factual data.
- Multimodal search: Search images with text, or find similar audio clips, using shared embedding spaces like CLIP.
- Recommendation: Find similar items (products, articles, users) based on embedding similarity.
- Deduplication: Detect near-duplicate content by comparing embedding distances.
How Vector Search Works
Vector search finds the nearest neighbors to a query vector in high-dimensional space. Since exact nearest neighbor search is too slow for large datasets, vector databases use Approximate Nearest Neighbor (ANN) algorithms:
ANN Algorithms
- HNSW (Hierarchical Navigable Small World): The most popular algorithm. Builds a multi-layer graph where each layer is a "small world" network. Offers excellent recall (accuracy) and query speed. Used by Qdrant, Weaviate, pgvector, and most others.
- IVF (Inverted File Index): Partitions vectors into clusters using k-means. Queries only search the nearest clusters. Faster indexing than HNSW but slightly lower recall. Used by Milvus and FAISS.
- PQ (Product Quantization): Compresses vectors by splitting them into sub-vectors and quantizing each. Reduces memory usage 10-50x at the cost of some accuracy. Often combined with IVF (IVF-PQ).
Distance Metrics
| Metric | Formula | Range | Best For |
|---|---|---|---|
| Cosine Similarity | cos(A, B) = A·B / (|A| × |B|) | -1 to 1 | Text embeddings (most common) |
| Euclidean (L2) | d = sqrt(sum((a-b)²)) | 0 to ∞ | Image embeddings, spatial data |
| Dot Product | A·B = sum(a×b) | -∞ to ∞ | Normalized embeddings (equivalent to cosine) |
Vector Database Comparison
| Database | Type | Scale | Filtering | Hosting | Pricing |
|---|---|---|---|---|---|
| Pinecone | Managed cloud | Billions | Rich metadata filters | Fully managed | Free tier + pay-per-use |
| Weaviate | Open source / Cloud | Billions | GraphQL + filters | Self-hosted or Weaviate Cloud | Open source / Cloud pricing |
| ChromaDB | Open source | Millions | Metadata filters | Self-hosted (embedded or server) | Free (open source) |
| Qdrant | Open source / Cloud | Billions | Advanced payload filters | Self-hosted or Qdrant Cloud | Open source / Cloud pricing |
| Milvus | Open source / Cloud | Billions+ | Scalar + vector filters | Self-hosted or Zilliz Cloud | Open source / Cloud pricing |
| pgvector | PostgreSQL extension | Millions | Full SQL | Any PostgreSQL host | Free (extension) |
Embedding Pipelines
Before storing data in a vector database, you must convert it to embeddings using a model. The pipeline is: raw data → chunking → embedding model → vector database.
import chromadb from chromadb.utils import embedding_functions from typing import Optional # Initialize ChromaDB (persistent storage) client = chromadb.PersistentClient(path="./chroma_data") # Use OpenAI embeddings (or any supported model) openai_ef = embedding_functions.OpenAIEmbeddingFunction( api_key="your-openai-api-key", model_name="text-embedding-3-small", ) # Create a collection collection = client.get_or_create_collection( name="technical_docs", embedding_function=openai_ef, metadata={"hnsw:space": "cosine"}, # Use cosine similarity ) # Add documents with metadata collection.add( documents=[ "vLLM uses PagedAttention to manage GPU memory efficiently for LLM inference.", "ChromaDB is an open-source embedding database designed for AI applications.", "HNSW is a graph-based algorithm for approximate nearest neighbor search.", "Quantization reduces model size by lowering numerical precision from FP16 to INT4.", "RAG combines retrieval with generation to ground LLM outputs in factual data.", "BentoML packages ML models as production-ready API services.", "Kubernetes autoscaling adjusts pod count based on CPU, memory, or custom metrics.", "Cosine similarity measures the angle between two vectors, ignoring magnitude.", ], ids=[f"doc_{i}" for i in range(8)], metadatas=[ {"category": "inference", "difficulty": "advanced"}, {"category": "database", "difficulty": "beginner"}, {"category": "algorithms", "difficulty": "intermediate"}, {"category": "optimization", "difficulty": "advanced"}, {"category": "rag", "difficulty": "intermediate"}, {"category": "serving", "difficulty": "intermediate"}, {"category": "infrastructure", "difficulty": "advanced"}, {"category": "algorithms", "difficulty": "beginner"}, ], ) # Query: semantic search results = collection.query( query_texts=["How do I serve large language models efficiently?"], n_results=3, ) print("Top 3 results:") for doc, dist in zip(results["documents"][0], results["distances"][0]): print(f" [{dist:.3f}] {doc}") # Query with metadata filtering filtered_results = collection.query( query_texts=["search algorithms"], n_results=3, where={"difficulty": "beginner"}, # Only beginner-level docs ) # Query with combined filters advanced_results = collection.query( query_texts=["GPU optimization"], n_results=5, where={ "$and": [ {"category": {"$in": ["inference", "optimization"]}}, {"difficulty": "advanced"}, ] }, )
Hybrid Search
Pure vector search can miss keyword-specific matches (e.g., exact product names, error codes). Hybrid search combines vector similarity with keyword/BM25 search for better recall:
import weaviate from weaviate.classes.query import MetadataQuery, HybridFusion # Connect to Weaviate client = weaviate.connect_to_local() # or weaviate.connect_to_wcs(...) # Create a collection with both vector and keyword indexing from weaviate.classes.config import Configure, Property, DataType client.collections.create( name="TechArticle", vectorizer_config=Configure.Vectorizer.text2vec_openai( model="text-embedding-3-small", ), properties=[ Property(name="title", data_type=DataType.TEXT), Property(name="content", data_type=DataType.TEXT), Property(name="category", data_type=DataType.TEXT), Property(name="date", data_type=DataType.DATE), ], ) # Add documents articles = client.collections.get("TechArticle") articles.data.insert_many([ {"title": "PagedAttention in vLLM", "content": "vLLM implements PagedAttention to manage KV cache memory...", "category": "inference"}, {"title": "CUDA Error OOM Debugging", "content": "RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB...", "category": "debugging"}, {"title": "Embedding Models Comparison", "content": "text-embedding-3-small vs voyage-3 vs cohere-embed-v3...", "category": "embeddings"}, ]) # Hybrid search: combines BM25 keyword + vector similarity results = articles.query.hybrid( query="CUDA out of memory error", alpha=0.5, # 0 = pure keyword, 1 = pure vector, 0.5 = balanced fusion_type=HybridFusion.RELATIVE_SCORE, limit=5, return_metadata=MetadataQuery(score=True, explain_score=True), ) for obj in results.objects: print(f"Score: {obj.metadata.score:.3f} | {obj.properties['title']}") print(f" {obj.properties['content'][:100]}...") # Pure vector search (semantic only) vector_results = articles.query.near_text( query="How to fix GPU memory issues", limit=5, ) # Pure keyword search (BM25 only) keyword_results = articles.query.bm25( query="CUDA OOM RuntimeError", limit=5, ) client.close()
pgvector: Vector Search in PostgreSQL
If you already use PostgreSQL, pgvector adds vector search without introducing a new database. This is ideal for applications where vector search is one feature among many relational data needs:
-- Enable the extension CREATE EXTENSION IF NOT EXISTS vector; -- Create a table with a vector column CREATE TABLE documents ( id SERIAL PRIMARY KEY, title TEXT NOT NULL, content TEXT NOT NULL, category TEXT, created_at TIMESTAMPTZ DEFAULT NOW(), embedding vector(1536) -- OpenAI text-embedding-3-small dimension ); -- Create an HNSW index for fast similarity search CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 200); -- Insert a document with its embedding INSERT INTO documents (title, content, category, embedding) VALUES ( 'vLLM Performance Guide', 'vLLM achieves high throughput using PagedAttention...', 'inference', '[0.023, -0.041, 0.078, ...]'::vector -- 1536-dim embedding ); -- Semantic search: find 5 most similar documents SELECT id, title, content, 1 - (embedding <=> '[query_embedding_here]'::vector) AS similarity FROM documents ORDER BY embedding <=> '[query_embedding_here]'::vector LIMIT 5; -- Combined: vector search + SQL filtering SELECT id, title, content, 1 - (embedding <=> '[query_embedding]'::vector) AS similarity FROM documents WHERE category = 'inference' AND created_at > NOW() - INTERVAL '30 days' ORDER BY embedding <=> '[query_embedding]'::vector LIMIT 10;
import asyncpg import numpy as np from openai import OpenAI from pgvector.asyncpg import register_vector openai_client = OpenAI() async def setup_db(): conn = await asyncpg.connect( "postgresql://user:pass@localhost:5432/mydb" ) await register_vector(conn) return conn async def embed_and_store(conn, title: str, content: str, category: str): # Generate embedding response = openai_client.embeddings.create( model="text-embedding-3-small", input=content, ) embedding = response.data[0].embedding # Store in PostgreSQL await conn.execute( "INSERT INTO documents (title, content, category, embedding) VALUES ($1, $2, $3, $4)", title, content, category, np.array(embedding), ) async def search(conn, query: str, category: str = None, limit: int = 5): # Embed the query response = openai_client.embeddings.create( model="text-embedding-3-small", input=query, ) query_embedding = np.array(response.data[0].embedding) # Search with optional category filter if category: rows = await conn.fetch( """SELECT id, title, content, 1 - (embedding <=> $1) AS similarity FROM documents WHERE category = $2 ORDER BY embedding <=> $1 LIMIT $3""", query_embedding, category, limit, ) else: rows = await conn.fetch( """SELECT id, title, content, 1 - (embedding <=> $1) AS similarity FROM documents ORDER BY embedding <=> $1 LIMIT $2""", query_embedding, limit, ) return rows
Multimodal Search with CLIP
CLIP (Contrastive Language-Image Pre-training) creates a shared embedding space for text and images. This enables searching images with text queries and vice versa:
from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams, PointStruct from transformers import CLIPProcessor, CLIPModel from PIL import Image import torch import glob # Load CLIP model model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") # Initialize Qdrant qdrant = QdrantClient(url="http://localhost:6333") qdrant.create_collection( collection_name="images", vectors_config=VectorParams(size=512, distance=Distance.COSINE), ) # Index images image_files = glob.glob("./images/*.jpg") points = [] for idx, img_path in enumerate(image_files): image = Image.open(img_path) inputs = processor(images=image, return_tensors="pt") with torch.no_grad(): embedding = model.get_image_features(**inputs) embedding = embedding / embedding.norm(dim=-1, keepdim=True) # Normalize points.append(PointStruct( id=idx, vector=embedding[0].tolist(), payload={"path": img_path, "filename": img_path.split("/")[-1]}, )) qdrant.upsert(collection_name="images", points=points) # Search images with a text query def search_images(text_query: str, limit: int = 5): inputs = processor(text=[text_query], return_tensors="pt") with torch.no_grad(): text_embedding = model.get_text_features(**inputs) text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True) results = qdrant.search( collection_name="images", query_vector=text_embedding[0].tolist(), limit=limit, ) return [(hit.payload["filename"], hit.score) for hit in results] # Find images of "a dog playing in the snow" matches = search_images("a dog playing in the snow") for filename, score in matches: print(f" {score:.3f} - {filename}")
Reranking
Vector search retrieves candidates quickly but imperfectly. Reranking uses a more expensive cross-encoder model to re-score and re-order the initial results for higher precision:
import cohere co = cohere.Client("your-cohere-api-key") def search_with_reranking(query: str, collection, top_k: int = 20, final_k: int = 5): # Step 1: Broad vector search (retrieve more than needed) results = collection.query( query_texts=[query], n_results=top_k, # Retrieve 20 candidates ) # Step 2: Rerank with Cohere reranked = co.rerank( model="rerank-english-v3.0", query=query, documents=results["documents"][0], top_n=final_k, # Return top 5 after reranking ) # Step 3: Return reranked results final_docs = [] for result in reranked.results: final_docs.append({ "text": result.document.text, "relevance_score": result.relevance_score, "original_index": result.index, }) return final_docs # Without reranking: vector search returns decent results # With reranking: significantly better precision, especially for nuanced queries # Typical improvement: +10-30% relevance on standard benchmarks
Performance Tuning
Optimizing vector database performance for production workloads:
Index Parameters
- HNSW
m(connections per node): Higher values (32-64) improve recall but increase memory and build time. Default 16 is good for most cases. - HNSW
ef_construction: Higher values (200-500) build a better index but take longer. Only affects build time, not query time. - HNSW
ef_search: Higher values improve recall at query time but increase latency. Tune this based on your recall vs. speed requirements.
Batch Operations
- Always insert vectors in batches (100-1000 at a time), not one by one.
- Use async/parallel embedding generation to saturate your embedding model.
- Pre-compute embeddings offline for large datasets rather than computing at insert time.
Connection Pooling
- Use connection pools for database clients (especially pgvector with asyncpg).
- Keep persistent connections to managed services (Pinecone, Weaviate Cloud).
- Set appropriate timeouts and retry logic for network-based vector databases.
Data Lifecycle Management
Vector databases need ongoing maintenance as your data evolves:
- Updates: When source documents change, re-embed and upsert the new vectors. Most vector databases support upsert (insert or update by ID).
- Deletions: Remove vectors when source documents are deleted. Use metadata filters to find and delete related vectors.
- Re-indexing: When you change embedding models (e.g., upgrading from text-embedding-ada-002 to text-embedding-3-small), you must re-embed all documents. Plan for zero-downtime migration using a blue-green approach: build a new collection, switch traffic, then delete the old one.
- Namespace management: Use namespaces or collections to separate different data sources, environments (dev/staging/prod), or tenants in multi-tenant applications.