Embedding Best Practices
Production-ready guidelines for choosing dimensions, preprocessing text, handling long documents, optimizing costs, and avoiding common mistakes.
Choosing Embedding Dimensions
- Start with the default. Use the model's native dimensions unless you have a specific reason to reduce them.
- 1536 is the sweet spot for most production applications (OpenAI text-embedding-3-small). Good quality, reasonable storage.
- Reduce dimensions for cost savings. OpenAI's v3 models can be reduced to 512 or 256 with ~2-5% quality loss. Test on your data to confirm acceptable quality.
- 384 for prototyping. Sentence Transformers'
all-MiniLM-L6-v2is fast and free, perfect for development. - Higher is not always better. Beyond 1536, improvements are marginal for most retrieval tasks. The extra storage and compute may not be justified.
Preprocessing Text for Better Embeddings
-
Clean the Text
Remove HTML tags, excessive whitespace, and special characters that do not carry meaning. Keep punctuation and capitalization — modern models use them for context.
-
Do NOT Over-Clean
Unlike traditional NLP, you should NOT lowercase, stem, or remove stopwords for embedding models. These models are trained on natural text and perform best with it.
-
Add Context When Helpful
If a document chunk lacks context, prepend metadata: "Category: Technology. Title: Vector Databases. Content: ..." This helps the model understand the topic.
-
Handle Structured Data Carefully
For tables and JSON, convert to natural language: "The product name is Widget X, priced at $29.99, with 4.5 stars." This embeds better than raw structured data.
import re
def preprocess_for_embedding(text, title=None, category=None):
"""Clean text for embedding while preserving meaning."""
# Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Normalize whitespace (but keep single spaces)
text = re.sub(r'\s+', ' ', text).strip()
# Remove URLs (optional - depends on use case)
text = re.sub(r'https?://\S+', '', text)
# Prepend context if available
prefix = ""
if category:
prefix += f"Category: {category}. "
if title:
prefix += f"Title: {title}. "
return prefix + text
# Example
clean = preprocess_for_embedding(
" <p>Vector databases store embeddings</p> ",
title="Vector DB Guide",
category="Technology"
)
# "Category: Technology. Title: Vector DB Guide. Vector databases store embeddings"
Handling Long Documents
Every embedding model has a maximum context length (e.g., 8,191 tokens for OpenAI, 512 for many open-source models). For longer documents, you need a chunking strategy:
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-size chunks | Split by character/token count with overlap | Simple, consistent chunk sizes |
| Semantic chunking | Split at natural boundaries (paragraphs, sections) | Preserving coherent ideas |
| Recursive splitting | Try paragraphs first, then sentences, then by size | Balanced approach (LangChain default) |
| Sentence window | Embed individual sentences; retrieve surrounding context | Fine-grained retrieval |
def chunk_text(text, chunk_size=500, overlap=50):
"""Split text into overlapping chunks."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = " ".join(words[start:end])
chunks.append(chunk)
start = end - overlap # Overlap for context continuity
return chunks
# Example: chunk a long document
document = "Your very long document text here..."
chunks = chunk_text(document, chunk_size=300, overlap=50)
print(f"Created {len(chunks)} chunks")
# Embed each chunk separately
embeddings = [embed(chunk) for chunk in chunks]
Cost Optimization
- Cache embeddings. Never embed the same text twice. Use a hash-based cache (text hash → embedding vector).
- Batch API calls. Send multiple texts per API request. OpenAI supports up to 2,048 texts per batch.
- Use smaller dimensions. Reduce OpenAI embeddings from 1536 to 512 to cut storage by 67%.
- Choose the right model.
text-embedding-3-smallat $0.02/1M tokens is 6.5x cheaper thantext-embedding-3-largeat $0.13/1M tokens. - Use open-source for dev. Run sentence-transformers locally for development and testing; switch to API models for production if needed.
- Precompute and store. Embed your corpus once and store the vectors. Only re-embed when the text changes or you switch models.
Monitoring Embedding Quality
- Set up a test suite. Create 50–100 query-document pairs with known relevance. Run Recall@10 weekly to detect quality regressions.
- Track similarity distributions. Monitor the distribution of similarity scores in production. A sudden shift may indicate a model issue or data drift.
- Log search feedback. Track which search results users click. Low click-through on top results suggests poor embedding quality.
- A/B test model changes. When considering a new model, run both side by side and compare user engagement metrics.
Version Management
- Record the model name and version for every embedding in your database (as metadata).
- Never mix embeddings from different models in the same index/collection.
- Plan for re-embedding. When you upgrade to a new model, you must re-embed all data. Budget time and cost for this.
- Keep raw text accessible. Always store the original text alongside or linked to the embedding so you can re-embed.
Common Mistakes
- Wrong model for the task: Using a general-purpose model for code search, or a monolingual model for multilingual content.
- Wrong distance metric: Using Euclidean distance when the model was trained for cosine similarity. Check your model's documentation.
- No normalization: If your model does not normalize vectors and you use dot product, results will be wrong. Either normalize manually or use cosine similarity.
- Truncation without awareness: Silently truncating text that exceeds the model's context window. You lose information without knowing it. Chunk intentionally instead.
- Embedding structured data as-is: Feeding JSON or CSV directly to an embedding model. Convert to natural language first.
- Ignoring asymmetric search: Some models (Voyage, Cohere, E5) use different prefixes for queries vs documents. Missing the prefix degrades quality significantly.
- Not caching: Re-embedding unchanged text wastes money and adds latency.
Frequently Asked Questions
For most text retrieval tasks, 768–1536 dimensions are sufficient. Higher dimensions give diminishing returns. If cost is a concern, try 512 dimensions with OpenAI's text-embedding-3-small — it retains ~95% of the quality at 1/3 the storage.
Almost always chunks. A single embedding for a long document dilutes the meaning. Chunk into 200–500 token pieces so each embedding captures a specific topic or idea. This gives much better retrieval precision.
No. Embeddings from different models exist in completely different vector spaces. A query embedded with Model A cannot meaningfully be compared to documents embedded with Model B. Always use the same model for both query and document embeddings.
Use a multilingual embedding model like Cohere embed-multilingual-v3.0 or the multilingual variants of sentence-transformers. These models encode all languages into a shared vector space, enabling cross-lingual search.
With OpenAI's text-embedding-3-small at $0.02 per million tokens, embedding 1 million documents of ~500 tokens each costs about $10. Storage costs depend on your vector database. For zero API cost, use open-source sentence-transformers locally.