Intermediate

Embedding & Vector Indexing

The embedding model and vector database you choose determine how well your RAG system can find relevant documents. This lesson covers how to select the right embedding model, compare vector databases for production use, choose indexing algorithms, and implement hybrid search.

Embedding Model Selection

An embedding model converts text into a dense vector (array of numbers) that captures semantic meaning. Similar texts produce similar vectors, enabling similarity-based retrieval.

Embedding Model Comparison

ModelDimensionsMax TokensMTEB ScoreCostNotes
text-embedding-3-large (OpenAI) 3072 8191 64.6 $0.13/1M tokens Best commercial option, supports dimension reduction
text-embedding-3-small (OpenAI) 1536 8191 62.3 $0.02/1M tokens Great cost/quality ratio, good default
embed-v4 (Cohere) 1024 512 64.2 $0.10/1M tokens Excellent multilingual support
nomic-embed-text 768 8192 62.4 Free (self-host) Best open-source option, long context
bge-large-en-v1.5 (BAAI) 1024 512 63.6 Free (self-host) Strong English-only performance
all-MiniLM-L6-v2 384 256 56.3 Free (self-host) Fastest, lowest memory, good for prototyping
💡
Apply at work: Start with text-embedding-3-small for commercial projects (best cost/quality). Use nomic-embed-text if you need to self-host or avoid API dependencies. Only upgrade to text-embedding-3-large if evaluation metrics show the small model is limiting retrieval quality.

Embedding Code Example

from openai import OpenAI

client = OpenAI()

def embed_texts(texts: list[str], model="text-embedding-3-small") -> list[list[float]]:
    """Batch-embed texts using OpenAI API."""
    response = client.embeddings.create(
        input=texts,
        model=model
    )
    return [item.embedding for item in response.data]

# Embed 100 chunks in a single API call (batching saves latency)
chunks = ["chunk 1 text...", "chunk 2 text...", ...]  # up to 2048 per call
vectors = embed_texts(chunks)

# For self-hosted models:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
vectors = model.encode(chunks, normalize_embeddings=True)

Vector Database Comparison

A vector database stores your embeddings and provides fast approximate nearest neighbor (ANN) search. Here is how the main options compare for production RAG:

DatabaseTypeHostingFilteringHybrid SearchBest For
Pinecone Managed SaaS Cloud only Excellent Yes (sparse+dense) Teams wanting zero ops, fast setup
Weaviate Open-source Self-host or cloud Excellent Yes (BM25+vector) Teams needing flexibility + hybrid search
Qdrant Open-source Self-host or cloud Excellent Yes (sparse vectors) High performance, Rust-based, great filtering
pgvector PostgreSQL extension Self-host Full SQL Manual (combine with tsvector) Teams already using PostgreSQL, <1M vectors
Chroma Open-source Self-host Basic No Prototyping, local development
Milvus Open-source Self-host or Zilliz cloud Good Yes Very large scale (>100M vectors)
📝
Production recommendation: Use pgvector if you already have PostgreSQL and fewer than 1M vectors — it avoids adding a new database to your stack. Use Qdrant or Weaviate for dedicated vector workloads with advanced filtering and hybrid search. Use Pinecone if you want fully managed with zero operational overhead.

Setup Examples

# --- Qdrant Setup ---
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(url="http://localhost:6333")

# Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)

# Insert vectors with metadata
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(
            id=i,
            vector=vectors[i],
            payload={"text": chunks[i], "source": "manual.pdf", "page": 5}
        )
        for i in range(len(chunks))
    ]
)

# --- pgvector Setup ---
# CREATE EXTENSION vector;
# CREATE TABLE documents (
#     id SERIAL PRIMARY KEY,
#     content TEXT,
#     embedding vector(1536),
#     metadata JSONB
# );
# CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

import psycopg2

conn = psycopg2.connect("postgresql://user:pass@localhost/ragdb")
cur = conn.cursor()
cur.execute(
    "INSERT INTO documents (content, embedding, metadata) VALUES (%s, %s, %s)",
    (chunk_text, str(vector), '{"source": "manual.pdf", "page": 5}')
)
conn.commit()

Indexing Algorithms: HNSW vs IVF

Vector databases use approximate nearest neighbor (ANN) algorithms to search billions of vectors in milliseconds. The two most common algorithms are:

AlgorithmHow It WorksQuery SpeedIndex Build TimeMemoryBest For
HNSW Builds a multi-layer graph of vectors; navigates from coarse to fine layers Very fast (sub-ms at 1M vectors) Slower to build High (stores graph) Low-latency production (<10M vectors)
IVF Clusters vectors into buckets; searches only nearby clusters Fast (configurable via nprobe) Faster to build Lower Very large datasets (>10M vectors)
💡
Apply at work: Use HNSW for most RAG systems. It is the default in Qdrant, Weaviate, and pgvector because it provides the best latency-accuracy tradeoff for datasets under 10M vectors. Only consider IVF when you have 10M+ vectors and need to optimize memory usage.

Hybrid Search: Vector + BM25

Pure vector search misses exact keyword matches (e.g., product SKUs, error codes, proper nouns). Hybrid search combines vector similarity with BM25 keyword scoring for the best of both worlds.

# Hybrid search with Weaviate
import weaviate

client = weaviate.Client("http://localhost:8080")

result = client.query.get(
    "Document", ["content", "source", "page"]
).with_hybrid(
    query="error code E-4021 in production",
    alpha=0.7  # 0.7 = 70% vector, 30% keyword
).with_limit(5).do()

# alpha=1.0 = pure vector search
# alpha=0.0 = pure keyword (BM25) search
# alpha=0.5-0.7 = recommended hybrid blend
# Hybrid search with Qdrant (sparse + dense vectors)
from qdrant_client.models import SparseVector, SearchRequest

# Dense (semantic) search
dense_results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=10
)

# Sparse (BM25-like) search using SPLADE or BM25 sparse vectors
sparse_results = client.search(
    collection_name="documents",
    query_vector=qdrant_client.models.NamedSparseVector(
        name="bm25",
        vector=SparseVector(indices=[1, 5, 100], values=[0.5, 0.8, 0.3])
    ),
    limit=10
)

# Combine with reciprocal rank fusion (RRF)
def reciprocal_rank_fusion(dense_ids, sparse_ids, k=60):
    scores = {}
    for rank, doc_id in enumerate(dense_ids):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    for rank, doc_id in enumerate(sparse_ids):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

Key Takeaways

  • Choose text-embedding-3-small as your default embedding model; upgrade only when evaluation proves it necessary.
  • Use pgvector if you already have PostgreSQL and <1M vectors. Use Qdrant or Weaviate for dedicated vector workloads.
  • HNSW indexing is the right default for most RAG systems — fast queries with excellent recall.
  • Implement hybrid search (vector + BM25) to catch both semantic matches and exact keyword matches.
  • Batch your embedding API calls (up to 2048 texts per call) to save latency and cost.