Intermediate

Embedding & Vector Indexing

The embedding model and vector database you choose determine how well your RAG system can find relevant documents. This lesson covers how to select the right embedding model, compare vector databases for production use, choose indexing algorithms, and implement hybrid search.

Embedding Model Selection

An embedding model converts text into a dense vector (array of numbers) that captures semantic meaning. Similar texts produce similar vectors, enabling similarity-based retrieval.

Embedding Model Comparison

Model	Dimensions	Max Tokens	MTEB Score	Cost	Notes
text-embedding-3-large (OpenAI)	3072	8191	64.6	$0.13/1M tokens	Best commercial option, supports dimension reduction
text-embedding-3-small (OpenAI)	1536	8191	62.3	$0.02/1M tokens	Great cost/quality ratio, good default
embed-v4 (Cohere)	1024	512	64.2	$0.10/1M tokens	Excellent multilingual support
nomic-embed-text	768	8192	62.4	Free (self-host)	Best open-source option, long context
bge-large-en-v1.5 (BAAI)	1024	512	63.6	Free (self-host)	Strong English-only performance
all-MiniLM-L6-v2	384	256	56.3	Free (self-host)	Fastest, lowest memory, good for prototyping

💡

Apply at work: Start with text-embedding-3-small for commercial projects (best cost/quality). Use nomic-embed-text if you need to self-host or avoid API dependencies. Only upgrade to text-embedding-3-large if evaluation metrics show the small model is limiting retrieval quality.

Embedding Code Example

from openai import OpenAI

client = OpenAI()

def embed_texts(texts: list[str], model="text-embedding-3-small") -> list[list[float]]:
    """Batch-embed texts using OpenAI API."""
    response = client.embeddings.create(
        input=texts,
        model=model
    )
    return [item.embedding for item in response.data]

# Embed 100 chunks in a single API call (batching saves latency)
chunks = ["chunk 1 text...", "chunk 2 text...", ...]  # up to 2048 per call
vectors = embed_texts(chunks)

# For self-hosted models:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
vectors = model.encode(chunks, normalize_embeddings=True)

Vector Database Comparison

A vector database stores your embeddings and provides fast approximate nearest neighbor (ANN) search. Here is how the main options compare for production RAG:

Database	Type	Hosting	Filtering	Hybrid Search	Best For
Pinecone	Managed SaaS	Cloud only	Excellent	Yes (sparse+dense)	Teams wanting zero ops, fast setup
Weaviate	Open-source	Self-host or cloud	Excellent	Yes (BM25+vector)	Teams needing flexibility + hybrid search
Qdrant	Open-source	Self-host or cloud	Excellent	Yes (sparse vectors)	High performance, Rust-based, great filtering
pgvector	PostgreSQL extension	Self-host	Full SQL	Manual (combine with tsvector)	Teams already using PostgreSQL, <1M vectors
Chroma	Open-source	Self-host	Basic	No	Prototyping, local development
Milvus	Open-source	Self-host or Zilliz cloud	Good	Yes	Very large scale (>100M vectors)

📝

Production recommendation: Use pgvector if you already have PostgreSQL and fewer than 1M vectors — it avoids adding a new database to your stack. Use Qdrant or Weaviate for dedicated vector workloads with advanced filtering and hybrid search. Use Pinecone if you want fully managed with zero operational overhead.

Setup Examples

# --- Qdrant Setup ---
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(url="http://localhost:6333")

# Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)

# Insert vectors with metadata
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(
            id=i,
            vector=vectors[i],
            payload={"text": chunks[i], "source": "manual.pdf", "page": 5}
        )
        for i in range(len(chunks))
    ]
)

# --- pgvector Setup ---
# CREATE EXTENSION vector;
# CREATE TABLE documents (
#     id SERIAL PRIMARY KEY,
#     content TEXT,
#     embedding vector(1536),
#     metadata JSONB
# );
# CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

import psycopg2

conn = psycopg2.connect("postgresql://user:pass@localhost/ragdb")
cur = conn.cursor()
cur.execute(
    "INSERT INTO documents (content, embedding, metadata) VALUES (%s, %s, %s)",
    (chunk_text, str(vector), '{"source": "manual.pdf", "page": 5}')
)
conn.commit()

Indexing Algorithms: HNSW vs IVF

Vector databases use approximate nearest neighbor (ANN) algorithms to search billions of vectors in milliseconds. The two most common algorithms are:

Algorithm	How It Works	Query Speed	Index Build Time	Memory	Best For
HNSW	Builds a multi-layer graph of vectors; navigates from coarse to fine layers	Very fast (sub-ms at 1M vectors)	Slower to build	High (stores graph)	Low-latency production (<10M vectors)
IVF	Clusters vectors into buckets; searches only nearby clusters	Fast (configurable via nprobe)	Faster to build	Lower	Very large datasets (>10M vectors)

💡

Apply at work: Use HNSW for most RAG systems. It is the default in Qdrant, Weaviate, and pgvector because it provides the best latency-accuracy tradeoff for datasets under 10M vectors. Only consider IVF when you have 10M+ vectors and need to optimize memory usage.

Hybrid Search: Vector + BM25

Pure vector search misses exact keyword matches (e.g., product SKUs, error codes, proper nouns). Hybrid search combines vector similarity with BM25 keyword scoring for the best of both worlds.

# Hybrid search with Weaviate
import weaviate

client = weaviate.Client("http://localhost:8080")

result = client.query.get(
    "Document", ["content", "source", "page"]
).with_hybrid(
    query="error code E-4021 in production",
    alpha=0.7  # 0.7 = 70% vector, 30% keyword
).with_limit(5).do()

# alpha=1.0 = pure vector search
# alpha=0.0 = pure keyword (BM25) search
# alpha=0.5-0.7 = recommended hybrid blend

# Hybrid search with Qdrant (sparse + dense vectors)
from qdrant_client.models import SparseVector, SearchRequest

# Dense (semantic) search
dense_results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=10
)

# Sparse (BM25-like) search using SPLADE or BM25 sparse vectors
sparse_results = client.search(
    collection_name="documents",
    query_vector=qdrant_client.models.NamedSparseVector(
        name="bm25",
        vector=SparseVector(indices=[1, 5, 100], values=[0.5, 0.8, 0.3])
    ),
    limit=10
)

# Combine with reciprocal rank fusion (RRF)
def reciprocal_rank_fusion(dense_ids, sparse_ids, k=60):
    scores = {}
    for rank, doc_id in enumerate(dense_ids):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    for rank, doc_id in enumerate(sparse_ids):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

Key Takeaways

Choose text-embedding-3-small as your default embedding model; upgrade only when evaluation proves it necessary.
Use pgvector if you already have PostgreSQL and <1M vectors. Use Qdrant or Weaviate for dedicated vector workloads.
HNSW indexing is the right default for most RAG systems — fast queries with excellent recall.
Implement hybrid search (vector + BM25) to catch both semantic matches and exact keyword matches.
Batch your embedding API calls (up to 2048 texts per call) to save latency and cost.

← Previous Document Ingestion Pipeline Next → Advanced Retrieval Strategies