Intermediate

Vector Databases & Retrieval

Vector databases are the backbone of retrieval-augmented generation (RAG) and multi-model AI applications. They store high-dimensional embeddings and enable fast similarity search across millions of documents, images, and other data types.

Why Vector Databases Matter

LLMs have knowledge cutoffs and limited context windows. Vector databases solve this by giving your AI application access to external knowledge through semantic search:

Semantic search: Find documents by meaning, not just keywords. "How do I fix a memory leak?" matches "debugging OOM errors in Java" even though they share no words.
RAG pipelines: Retrieve relevant context before generating a response, grounding the LLM in factual data.
Multimodal search: Search images with text, or find similar audio clips, using shared embedding spaces like CLIP.
Recommendation: Find similar items (products, articles, users) based on embedding similarity.
Deduplication: Detect near-duplicate content by comparing embedding distances.

How Vector Search Works

Vector search finds the nearest neighbors to a query vector in high-dimensional space. Since exact nearest neighbor search is too slow for large datasets, vector databases use Approximate Nearest Neighbor (ANN) algorithms:

ANN Algorithms

HNSW (Hierarchical Navigable Small World): The most popular algorithm. Builds a multi-layer graph where each layer is a "small world" network. Offers excellent recall (accuracy) and query speed. Used by Qdrant, Weaviate, pgvector, and most others.
IVF (Inverted File Index): Partitions vectors into clusters using k-means. Queries only search the nearest clusters. Faster indexing than HNSW but slightly lower recall. Used by Milvus and FAISS.
PQ (Product Quantization): Compresses vectors by splitting them into sub-vectors and quantizing each. Reduces memory usage 10-50x at the cost of some accuracy. Often combined with IVF (IVF-PQ).

Distance Metrics

Metric	Formula	Range	Best For
Cosine Similarity	cos(A, B) = A·B / (\|A\| × \|B\|)	-1 to 1	Text embeddings (most common)
Euclidean (L2)	d = sqrt(sum((a-b)²))	0 to ∞	Image embeddings, spatial data
Dot Product	A·B = sum(a×b)	-∞ to ∞	Normalized embeddings (equivalent to cosine)

💡

Which metric to use? For text embeddings from OpenAI, Cohere, or Voyage AI, use cosine similarity. Most embedding models normalize their output vectors, so cosine similarity and dot product give identical results. Euclidean distance works better for spatial or image-based embeddings.

Vector Database Comparison

Database	Type	Scale	Filtering	Hosting	Pricing
Pinecone	Managed cloud	Billions	Rich metadata filters	Fully managed	Free tier + pay-per-use
Weaviate	Open source / Cloud	Billions	GraphQL + filters	Self-hosted or Weaviate Cloud	Open source / Cloud pricing
ChromaDB	Open source	Millions	Metadata filters	Self-hosted (embedded or server)	Free (open source)
Qdrant	Open source / Cloud	Billions	Advanced payload filters	Self-hosted or Qdrant Cloud	Open source / Cloud pricing
Milvus	Open source / Cloud	Billions+	Scalar + vector filters	Self-hosted or Zilliz Cloud	Open source / Cloud pricing
pgvector	PostgreSQL extension	Millions	Full SQL	Any PostgreSQL host	Free (extension)

Embedding Pipelines

Before storing data in a vector database, you must convert it to embeddings using a model. The pipeline is: raw data → chunking → embedding model → vector database.

Python — Full Pipeline with ChromaDB

import chromadb
from chromadb.utils import embedding_functions
from typing import Optional

# Initialize ChromaDB (persistent storage)
client = chromadb.PersistentClient(path="./chroma_data")

# Use OpenAI embeddings (or any supported model)
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-openai-api-key",
    model_name="text-embedding-3-small",
)

# Create a collection
collection = client.get_or_create_collection(
    name="technical_docs",
    embedding_function=openai_ef,
    metadata={"hnsw:space": "cosine"},  # Use cosine similarity
)

# Add documents with metadata
collection.add(
    documents=[
        "vLLM uses PagedAttention to manage GPU memory efficiently for LLM inference.",
        "ChromaDB is an open-source embedding database designed for AI applications.",
        "HNSW is a graph-based algorithm for approximate nearest neighbor search.",
        "Quantization reduces model size by lowering numerical precision from FP16 to INT4.",
        "RAG combines retrieval with generation to ground LLM outputs in factual data.",
        "BentoML packages ML models as production-ready API services.",
        "Kubernetes autoscaling adjusts pod count based on CPU, memory, or custom metrics.",
        "Cosine similarity measures the angle between two vectors, ignoring magnitude.",
    ],
    ids=[f"doc_{i}" for i in range(8)],
    metadatas=[
        {"category": "inference", "difficulty": "advanced"},
        {"category": "database", "difficulty": "beginner"},
        {"category": "algorithms", "difficulty": "intermediate"},
        {"category": "optimization", "difficulty": "advanced"},
        {"category": "rag", "difficulty": "intermediate"},
        {"category": "serving", "difficulty": "intermediate"},
        {"category": "infrastructure", "difficulty": "advanced"},
        {"category": "algorithms", "difficulty": "beginner"},
    ],
)

# Query: semantic search
results = collection.query(
    query_texts=["How do I serve large language models efficiently?"],
    n_results=3,
)
print("Top 3 results:")
for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"  [{dist:.3f}] {doc}")

# Query with metadata filtering
filtered_results = collection.query(
    query_texts=["search algorithms"],
    n_results=3,
    where={"difficulty": "beginner"},  # Only beginner-level docs
)

# Query with combined filters
advanced_results = collection.query(
    query_texts=["GPU optimization"],
    n_results=5,
    where={
        "$and": [
            {"category": {"$in": ["inference", "optimization"]}},
            {"difficulty": "advanced"},
        ]
    },
)

Hybrid Search

Pure vector search can miss keyword-specific matches (e.g., exact product names, error codes). Hybrid search combines vector similarity with keyword/BM25 search for better recall:

Python — Hybrid Search with Weaviate

import weaviate
from weaviate.classes.query import MetadataQuery, HybridFusion

# Connect to Weaviate
client = weaviate.connect_to_local()  # or weaviate.connect_to_wcs(...)

# Create a collection with both vector and keyword indexing
from weaviate.classes.config import Configure, Property, DataType

client.collections.create(
    name="TechArticle",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small",
    ),
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="content", data_type=DataType.TEXT),
        Property(name="category", data_type=DataType.TEXT),
        Property(name="date", data_type=DataType.DATE),
    ],
)

# Add documents
articles = client.collections.get("TechArticle")
articles.data.insert_many([
    {"title": "PagedAttention in vLLM", "content": "vLLM implements PagedAttention to manage KV cache memory...", "category": "inference"},
    {"title": "CUDA Error OOM Debugging", "content": "RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB...", "category": "debugging"},
    {"title": "Embedding Models Comparison", "content": "text-embedding-3-small vs voyage-3 vs cohere-embed-v3...", "category": "embeddings"},
])

# Hybrid search: combines BM25 keyword + vector similarity
results = articles.query.hybrid(
    query="CUDA out of memory error",
    alpha=0.5,  # 0 = pure keyword, 1 = pure vector, 0.5 = balanced
    fusion_type=HybridFusion.RELATIVE_SCORE,
    limit=5,
    return_metadata=MetadataQuery(score=True, explain_score=True),
)

for obj in results.objects:
    print(f"Score: {obj.metadata.score:.3f} | {obj.properties['title']}")
    print(f"  {obj.properties['content'][:100]}...")

# Pure vector search (semantic only)
vector_results = articles.query.near_text(
    query="How to fix GPU memory issues",
    limit=5,
)

# Pure keyword search (BM25 only)
keyword_results = articles.query.bm25(
    query="CUDA OOM RuntimeError",
    limit=5,
)

client.close()

pgvector: Vector Search in PostgreSQL

If you already use PostgreSQL, pgvector adds vector search without introducing a new database. This is ideal for applications where vector search is one feature among many relational data needs:

SQL + Python — pgvector with PostgreSQL

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create a table with a vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    category TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    embedding vector(1536)  -- OpenAI text-embedding-3-small dimension
);

-- Create an HNSW index for fast similarity search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Insert a document with its embedding
INSERT INTO documents (title, content, category, embedding)
VALUES (
    'vLLM Performance Guide',
    'vLLM achieves high throughput using PagedAttention...',
    'inference',
    '[0.023, -0.041, 0.078, ...]'::vector  -- 1536-dim embedding
);

-- Semantic search: find 5 most similar documents
SELECT id, title, content,
       1 - (embedding <=> '[query_embedding_here]'::vector) AS similarity
FROM documents
ORDER BY embedding <=> '[query_embedding_here]'::vector
LIMIT 5;

-- Combined: vector search + SQL filtering
SELECT id, title, content,
       1 - (embedding <=> '[query_embedding]'::vector) AS similarity
FROM documents
WHERE category = 'inference'
  AND created_at > NOW() - INTERVAL '30 days'
ORDER BY embedding <=> '[query_embedding]'::vector
LIMIT 10;

Python — pgvector with asyncpg

import asyncpg
import numpy as np
from openai import OpenAI
from pgvector.asyncpg import register_vector

openai_client = OpenAI()

async def setup_db():
    conn = await asyncpg.connect(
        "postgresql://user:pass@localhost:5432/mydb"
    )
    await register_vector(conn)
    return conn

async def embed_and_store(conn, title: str, content: str, category: str):
    # Generate embedding
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=content,
    )
    embedding = response.data[0].embedding

    # Store in PostgreSQL
    await conn.execute(
        "INSERT INTO documents (title, content, category, embedding) VALUES ($1, $2, $3, $4)",
        title, content, category, np.array(embedding),
    )

async def search(conn, query: str, category: str = None, limit: int = 5):
    # Embed the query
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=query,
    )
    query_embedding = np.array(response.data[0].embedding)

    # Search with optional category filter
    if category:
        rows = await conn.fetch(
            """SELECT id, title, content,
                      1 - (embedding <=> $1) AS similarity
               FROM documents
               WHERE category = $2
               ORDER BY embedding <=> $1
               LIMIT $3""",
            query_embedding, category, limit,
        )
    else:
        rows = await conn.fetch(
            """SELECT id, title, content,
                      1 - (embedding <=> $1) AS similarity
               FROM documents
               ORDER BY embedding <=> $1
               LIMIT $2""",
            query_embedding, limit,
        )
    return rows

Multimodal Search with CLIP

CLIP (Contrastive Language-Image Pre-training) creates a shared embedding space for text and images. This enables searching images with text queries and vice versa:

Python — Multimodal Search: Text-to-Image with CLIP + Qdrant

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import glob

# Load CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Initialize Qdrant
qdrant = QdrantClient(url="http://localhost:6333")
qdrant.create_collection(
    collection_name="images",
    vectors_config=VectorParams(size=512, distance=Distance.COSINE),
)

# Index images
image_files = glob.glob("./images/*.jpg")
points = []
for idx, img_path in enumerate(image_files):
    image = Image.open(img_path)
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        embedding = model.get_image_features(**inputs)
    embedding = embedding / embedding.norm(dim=-1, keepdim=True)  # Normalize

    points.append(PointStruct(
        id=idx,
        vector=embedding[0].tolist(),
        payload={"path": img_path, "filename": img_path.split("/")[-1]},
    ))

qdrant.upsert(collection_name="images", points=points)

# Search images with a text query
def search_images(text_query: str, limit: int = 5):
    inputs = processor(text=[text_query], return_tensors="pt")
    with torch.no_grad():
        text_embedding = model.get_text_features(**inputs)
    text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True)

    results = qdrant.search(
        collection_name="images",
        query_vector=text_embedding[0].tolist(),
        limit=limit,
    )
    return [(hit.payload["filename"], hit.score) for hit in results]

# Find images of "a dog playing in the snow"
matches = search_images("a dog playing in the snow")
for filename, score in matches:
    print(f"  {score:.3f} - {filename}")

Reranking

Vector search retrieves candidates quickly but imperfectly. Reranking uses a more expensive cross-encoder model to re-score and re-order the initial results for higher precision:

Python — Reranking with Cohere

import cohere

co = cohere.Client("your-cohere-api-key")

def search_with_reranking(query: str, collection, top_k: int = 20, final_k: int = 5):
    # Step 1: Broad vector search (retrieve more than needed)
    results = collection.query(
        query_texts=[query],
        n_results=top_k,  # Retrieve 20 candidates
    )

    # Step 2: Rerank with Cohere
    reranked = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=results["documents"][0],
        top_n=final_k,  # Return top 5 after reranking
    )

    # Step 3: Return reranked results
    final_docs = []
    for result in reranked.results:
        final_docs.append({
            "text": result.document.text,
            "relevance_score": result.relevance_score,
            "original_index": result.index,
        })
    return final_docs

# Without reranking: vector search returns decent results
# With reranking: significantly better precision, especially for nuanced queries
# Typical improvement: +10-30% relevance on standard benchmarks

Performance Tuning

Optimizing vector database performance for production workloads:

Index Parameters

HNSW m (connections per node): Higher values (32-64) improve recall but increase memory and build time. Default 16 is good for most cases.
HNSW ef_construction: Higher values (200-500) build a better index but take longer. Only affects build time, not query time.
HNSW ef_search: Higher values improve recall at query time but increase latency. Tune this based on your recall vs. speed requirements.

Batch Operations

Always insert vectors in batches (100-1000 at a time), not one by one.
Use async/parallel embedding generation to saturate your embedding model.
Pre-compute embeddings offline for large datasets rather than computing at insert time.

Connection Pooling

Use connection pools for database clients (especially pgvector with asyncpg).
Keep persistent connections to managed services (Pinecone, Weaviate Cloud).
Set appropriate timeouts and retry logic for network-based vector databases.

Data Lifecycle Management

Vector databases need ongoing maintenance as your data evolves:

Updates: When source documents change, re-embed and upsert the new vectors. Most vector databases support upsert (insert or update by ID).
Deletions: Remove vectors when source documents are deleted. Use metadata filters to find and delete related vectors.
Re-indexing: When you change embedding models (e.g., upgrading from text-embedding-ada-002 to text-embedding-3-small), you must re-embed all documents. Plan for zero-downtime migration using a blue-green approach: build a new collection, switch traffic, then delete the old one.
Namespace management: Use namespaces or collections to separate different data sources, environments (dev/staging/prod), or tenants in multi-tenant applications.

✅

Practical advice: Start with ChromaDB or pgvector for prototyping. Move to Qdrant or Weaviate when you need advanced features (hybrid search, multimodal). Use Pinecone or Milvus (Zilliz) when you need fully managed, billion-scale deployment. The embedding model matters more than the database choice — a better embedding model improves results across any vector store.

← Previous Model Serving Next → Production Pipelines