Embedding & Vector Indexing
The embedding model and vector database you choose determine how well your RAG system can find relevant documents. This lesson covers how to select the right embedding model, compare vector databases for production use, choose indexing algorithms, and implement hybrid search.
Embedding Model Selection
An embedding model converts text into a dense vector (array of numbers) that captures semantic meaning. Similar texts produce similar vectors, enabling similarity-based retrieval.
Embedding Model Comparison
| Model | Dimensions | Max Tokens | MTEB Score | Cost | Notes |
|---|---|---|---|---|---|
| text-embedding-3-large (OpenAI) | 3072 | 8191 | 64.6 | $0.13/1M tokens | Best commercial option, supports dimension reduction |
| text-embedding-3-small (OpenAI) | 1536 | 8191 | 62.3 | $0.02/1M tokens | Great cost/quality ratio, good default |
| embed-v4 (Cohere) | 1024 | 512 | 64.2 | $0.10/1M tokens | Excellent multilingual support |
| nomic-embed-text | 768 | 8192 | 62.4 | Free (self-host) | Best open-source option, long context |
| bge-large-en-v1.5 (BAAI) | 1024 | 512 | 63.6 | Free (self-host) | Strong English-only performance |
| all-MiniLM-L6-v2 | 384 | 256 | 56.3 | Free (self-host) | Fastest, lowest memory, good for prototyping |
text-embedding-3-small for commercial projects (best cost/quality). Use nomic-embed-text if you need to self-host or avoid API dependencies. Only upgrade to text-embedding-3-large if evaluation metrics show the small model is limiting retrieval quality.Embedding Code Example
from openai import OpenAI
client = OpenAI()
def embed_texts(texts: list[str], model="text-embedding-3-small") -> list[list[float]]:
"""Batch-embed texts using OpenAI API."""
response = client.embeddings.create(
input=texts,
model=model
)
return [item.embedding for item in response.data]
# Embed 100 chunks in a single API call (batching saves latency)
chunks = ["chunk 1 text...", "chunk 2 text...", ...] # up to 2048 per call
vectors = embed_texts(chunks)
# For self-hosted models:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
vectors = model.encode(chunks, normalize_embeddings=True)
Vector Database Comparison
A vector database stores your embeddings and provides fast approximate nearest neighbor (ANN) search. Here is how the main options compare for production RAG:
| Database | Type | Hosting | Filtering | Hybrid Search | Best For |
|---|---|---|---|---|---|
| Pinecone | Managed SaaS | Cloud only | Excellent | Yes (sparse+dense) | Teams wanting zero ops, fast setup |
| Weaviate | Open-source | Self-host or cloud | Excellent | Yes (BM25+vector) | Teams needing flexibility + hybrid search |
| Qdrant | Open-source | Self-host or cloud | Excellent | Yes (sparse vectors) | High performance, Rust-based, great filtering |
| pgvector | PostgreSQL extension | Self-host | Full SQL | Manual (combine with tsvector) | Teams already using PostgreSQL, <1M vectors |
| Chroma | Open-source | Self-host | Basic | No | Prototyping, local development |
| Milvus | Open-source | Self-host or Zilliz cloud | Good | Yes | Very large scale (>100M vectors) |
Setup Examples
# --- Qdrant Setup ---
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
client = QdrantClient(url="http://localhost:6333")
# Create collection
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)
# Insert vectors with metadata
client.upsert(
collection_name="documents",
points=[
PointStruct(
id=i,
vector=vectors[i],
payload={"text": chunks[i], "source": "manual.pdf", "page": 5}
)
for i in range(len(chunks))
]
)
# --- pgvector Setup ---
# CREATE EXTENSION vector;
# CREATE TABLE documents (
# id SERIAL PRIMARY KEY,
# content TEXT,
# embedding vector(1536),
# metadata JSONB
# );
# CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
import psycopg2
conn = psycopg2.connect("postgresql://user:pass@localhost/ragdb")
cur = conn.cursor()
cur.execute(
"INSERT INTO documents (content, embedding, metadata) VALUES (%s, %s, %s)",
(chunk_text, str(vector), '{"source": "manual.pdf", "page": 5}')
)
conn.commit()
Indexing Algorithms: HNSW vs IVF
Vector databases use approximate nearest neighbor (ANN) algorithms to search billions of vectors in milliseconds. The two most common algorithms are:
| Algorithm | How It Works | Query Speed | Index Build Time | Memory | Best For |
|---|---|---|---|---|---|
| HNSW | Builds a multi-layer graph of vectors; navigates from coarse to fine layers | Very fast (sub-ms at 1M vectors) | Slower to build | High (stores graph) | Low-latency production (<10M vectors) |
| IVF | Clusters vectors into buckets; searches only nearby clusters | Fast (configurable via nprobe) | Faster to build | Lower | Very large datasets (>10M vectors) |
Hybrid Search: Vector + BM25
Pure vector search misses exact keyword matches (e.g., product SKUs, error codes, proper nouns). Hybrid search combines vector similarity with BM25 keyword scoring for the best of both worlds.
# Hybrid search with Weaviate
import weaviate
client = weaviate.Client("http://localhost:8080")
result = client.query.get(
"Document", ["content", "source", "page"]
).with_hybrid(
query="error code E-4021 in production",
alpha=0.7 # 0.7 = 70% vector, 30% keyword
).with_limit(5).do()
# alpha=1.0 = pure vector search
# alpha=0.0 = pure keyword (BM25) search
# alpha=0.5-0.7 = recommended hybrid blend
# Hybrid search with Qdrant (sparse + dense vectors)
from qdrant_client.models import SparseVector, SearchRequest
# Dense (semantic) search
dense_results = client.search(
collection_name="documents",
query_vector=query_embedding,
limit=10
)
# Sparse (BM25-like) search using SPLADE or BM25 sparse vectors
sparse_results = client.search(
collection_name="documents",
query_vector=qdrant_client.models.NamedSparseVector(
name="bm25",
vector=SparseVector(indices=[1, 5, 100], values=[0.5, 0.8, 0.3])
),
limit=10
)
# Combine with reciprocal rank fusion (RRF)
def reciprocal_rank_fusion(dense_ids, sparse_ids, k=60):
scores = {}
for rank, doc_id in enumerate(dense_ids):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
for rank, doc_id in enumerate(sparse_ids):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores, key=scores.get, reverse=True)
Key Takeaways
- Choose
text-embedding-3-smallas your default embedding model; upgrade only when evaluation proves it necessary. - Use pgvector if you already have PostgreSQL and <1M vectors. Use Qdrant or Weaviate for dedicated vector workloads.
- HNSW indexing is the right default for most RAG systems — fast queries with excellent recall.
- Implement hybrid search (vector + BM25) to catch both semantic matches and exact keyword matches.
- Batch your embedding API calls (up to 2048 texts per call) to save latency and cost.