Vector Database Best Practices
Production-ready tips for choosing, configuring, operating, and optimizing vector databases in real-world AI applications.
Choosing the Right Vector Database
-
Start with Your Constraints
Before comparing features, identify your constraints: budget, team expertise, existing infrastructure, data volume, query latency requirements, and compliance needs. These narrow the field quickly.
-
Prototype with ChromaDB, Scale with Others
Build your proof-of-concept with ChromaDB (fastest to set up). Once you validate the approach, migrate to a production database that matches your scale and operational requirements.
-
Consider Operational Complexity
A managed service (Pinecone, Weaviate Cloud) costs more but saves engineering time. Self-hosted (Qdrant, Milvus) is cheaper but requires infrastructure expertise.
Index Configuration
- Use HNSW for most workloads. It offers the best balance of recall, speed, and simplicity. Start with default parameters and tune from there.
- Increase ef_search for higher recall. If search quality matters more than latency, raise
ef_search(HNSW) ornprobe(IVF). - Match dimensions to your model. The index dimension must exactly match your embedding model's output dimension.
- Use the same distance metric as your model. Most text embedding models are trained with cosine similarity. Using a different metric will give poor results.
Batch Operations
import asyncio
from concurrent.futures import ThreadPoolExecutor
def batch_upsert(vectors, batch_size=100):
"""Upsert vectors in batches for efficiency."""
for i in range(0, len(vectors), batch_size):
batch = vectors[i:i + batch_size]
index.upsert(vectors=batch)
print(f"Upserted batch {i // batch_size + 1}")
# For very large datasets, use parallel batches
async def parallel_upsert(vectors, batch_size=100, max_workers=4):
"""Upsert in parallel for maximum throughput."""
batches = [vectors[i:i+batch_size]
for i in range(0, len(vectors), batch_size)]
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [
executor.submit(index.upsert, vectors=batch)
for batch in batches
]
for future in futures:
future.result() # Wait for completion
Monitoring and Maintenance
- Track query latency percentiles (p50, p95, p99), not just averages. A healthy p99 under 100ms is typical.
- Monitor recall quality by periodically comparing ANN results against brute-force exact search on a sample.
- Watch index size and memory usage. HNSW indexes grow with data. Plan capacity accordingly.
- Set up alerts for query latency spikes, error rates, and storage thresholds.
- Log query patterns to identify popular queries, cache opportunities, and optimization targets.
Cost Optimization
| Strategy | Impact | Trade-off |
|---|---|---|
| Use smaller embeddings | 2–4x storage savings | Slightly lower recall |
| Enable quantization | 4–8x memory savings | Lower recall accuracy |
| Use serverless | Pay only for usage | Possible cold starts |
| Cache frequent queries | Eliminate redundant searches | Stale results possible |
| Reduce metadata | Lower storage costs | Less flexible filtering |
| Partition by time | Archive old data | More complex queries |
Security
- Never embed API keys in code. Use environment variables or secret managers (AWS Secrets Manager, Vault).
- Enable authentication on self-hosted deployments. Many vector databases ship with auth disabled by default.
- Use TLS/SSL for all connections, especially in production.
- Implement access control. Use namespaces, multi-tenancy, or row-level security to isolate data.
- Audit access logs to track who queries what data.
Backup Strategies
- For managed services: Most providers handle backups automatically. Verify the backup frequency and retention policy.
- For self-hosted: Schedule regular snapshots of the data directory. Test restoration periodically.
- Keep a copy of raw embeddings separately (e.g., in object storage). If you lose the index, you can rebuild it from the raw vectors.
- Store the embedding model version. If you need to re-embed data, you must use the same model version for consistency.
Scaling Patterns
-
Vertical Scaling
Add more RAM and faster storage. HNSW indexes are memory-bound, so more RAM directly improves capacity. Effective up to ~10M vectors on a single node.
-
Horizontal Scaling (Sharding)
Distribute data across multiple nodes. Each shard holds a subset of vectors. Queries are fanned out to all shards and results are merged. Supported by Milvus, Weaviate, and Qdrant.
-
Read Replicas
Add read-only replicas to handle more query traffic. Writes go to the primary, reads are distributed across replicas.
-
Tiered Storage
Keep frequently accessed vectors in memory and older vectors on disk. Some databases (Milvus, Qdrant) support this natively.
Common Mistakes
- Mismatched dimensions: The index dimension must exactly match your embedding model output. A mismatch causes errors or garbage results.
- Wrong distance metric: Using Euclidean distance with a model trained for cosine similarity gives poor rankings.
- Not normalizing vectors: If your model does not output normalized vectors, normalize them before inserting if you use dot product.
- Mixing embedding models: All vectors in a collection must come from the same model. Mixing models makes similarity meaningless.
- Storing too much metadata: Large metadata payloads slow down queries. Store only what you need for filtering; keep full documents elsewhere.
- Ignoring index warm-up: HNSW indexes need to be loaded into RAM. First queries after restart may be slow.
- No evaluation pipeline: Not measuring search quality means you cannot tell if changes improve or degrade results.
Frequently Asked Questions
It depends on the dimensions and index type. As a rough guide: with HNSW and 1536-dimension vectors, a machine with 32GB RAM can handle about 2–5 million vectors. With quantization, this can increase to 10–20 million.
A vector database stores and searches vectors, but it does not create them. You need an embedding model (OpenAI, Sentence Transformers, etc.) to convert your data into vectors first. Some databases like Weaviate have built-in vectorizers that handle this for you.
If you already use PostgreSQL and have fewer than 5 million vectors, pgvector is often the simplest choice. For larger scale, higher query throughput, or features like hybrid search, a dedicated vector database is better suited.
Create a test set with known relevant results for sample queries. Compute metrics like recall@k (what fraction of the true top-k results does the ANN search return?) and MRR (mean reciprocal rank). Aim for recall@10 above 95%.
You must re-embed all your data with the new model and create a new index. Vectors from different models live in different vector spaces and cannot be compared. Plan for this by keeping your raw text data accessible.
Lilly Tech Systems