Advanced

Embedding Best Practices

Production-ready guidelines for choosing dimensions, preprocessing text, handling long documents, optimizing costs, and avoiding common mistakes.

Choosing Embedding Dimensions

  • Start with the default. Use the model's native dimensions unless you have a specific reason to reduce them.
  • 1536 is the sweet spot for most production applications (OpenAI text-embedding-3-small). Good quality, reasonable storage.
  • Reduce dimensions for cost savings. OpenAI's v3 models can be reduced to 512 or 256 with ~2-5% quality loss. Test on your data to confirm acceptable quality.
  • 384 for prototyping. Sentence Transformers' all-MiniLM-L6-v2 is fast and free, perfect for development.
  • Higher is not always better. Beyond 1536, improvements are marginal for most retrieval tasks. The extra storage and compute may not be justified.

Preprocessing Text for Better Embeddings

  1. Clean the Text

    Remove HTML tags, excessive whitespace, and special characters that do not carry meaning. Keep punctuation and capitalization — modern models use them for context.

  2. Do NOT Over-Clean

    Unlike traditional NLP, you should NOT lowercase, stem, or remove stopwords for embedding models. These models are trained on natural text and perform best with it.

  3. Add Context When Helpful

    If a document chunk lacks context, prepend metadata: "Category: Technology. Title: Vector Databases. Content: ..." This helps the model understand the topic.

  4. Handle Structured Data Carefully

    For tables and JSON, convert to natural language: "The product name is Widget X, priced at $29.99, with 4.5 stars." This embeds better than raw structured data.

Python - Text Preprocessing
import re

def preprocess_for_embedding(text, title=None, category=None):
    """Clean text for embedding while preserving meaning."""
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)

    # Normalize whitespace (but keep single spaces)
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove URLs (optional - depends on use case)
    text = re.sub(r'https?://\S+', '', text)

    # Prepend context if available
    prefix = ""
    if category:
        prefix += f"Category: {category}. "
    if title:
        prefix += f"Title: {title}. "

    return prefix + text

# Example
clean = preprocess_for_embedding(
    "  <p>Vector databases  store   embeddings</p>  ",
    title="Vector DB Guide",
    category="Technology"
)
# "Category: Technology. Title: Vector DB Guide. Vector databases store embeddings"

Handling Long Documents

Every embedding model has a maximum context length (e.g., 8,191 tokens for OpenAI, 512 for many open-source models). For longer documents, you need a chunking strategy:

Strategy How It Works Best For
Fixed-size chunks Split by character/token count with overlap Simple, consistent chunk sizes
Semantic chunking Split at natural boundaries (paragraphs, sections) Preserving coherent ideas
Recursive splitting Try paragraphs first, then sentences, then by size Balanced approach (LangChain default)
Sentence window Embed individual sentences; retrieve surrounding context Fine-grained retrieval
Python - Chunking with Overlap
def chunk_text(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start = end - overlap  # Overlap for context continuity

    return chunks

# Example: chunk a long document
document = "Your very long document text here..."
chunks = chunk_text(document, chunk_size=300, overlap=50)
print(f"Created {len(chunks)} chunks")

# Embed each chunk separately
embeddings = [embed(chunk) for chunk in chunks]
Chunk size guidelines: For RAG applications, 200–500 tokens per chunk works well. Too small and you lose context. Too large and the embedding becomes diluted. Always add overlap (10–20% of chunk size) to avoid cutting important information at boundaries.

Cost Optimization

  • Cache embeddings. Never embed the same text twice. Use a hash-based cache (text hash → embedding vector).
  • Batch API calls. Send multiple texts per API request. OpenAI supports up to 2,048 texts per batch.
  • Use smaller dimensions. Reduce OpenAI embeddings from 1536 to 512 to cut storage by 67%.
  • Choose the right model. text-embedding-3-small at $0.02/1M tokens is 6.5x cheaper than text-embedding-3-large at $0.13/1M tokens.
  • Use open-source for dev. Run sentence-transformers locally for development and testing; switch to API models for production if needed.
  • Precompute and store. Embed your corpus once and store the vectors. Only re-embed when the text changes or you switch models.

Monitoring Embedding Quality

  • Set up a test suite. Create 50–100 query-document pairs with known relevance. Run Recall@10 weekly to detect quality regressions.
  • Track similarity distributions. Monitor the distribution of similarity scores in production. A sudden shift may indicate a model issue or data drift.
  • Log search feedback. Track which search results users click. Low click-through on top results suggests poor embedding quality.
  • A/B test model changes. When considering a new model, run both side by side and compare user engagement metrics.

Version Management

  • Record the model name and version for every embedding in your database (as metadata).
  • Never mix embeddings from different models in the same index/collection.
  • Plan for re-embedding. When you upgrade to a new model, you must re-embed all data. Budget time and cost for this.
  • Keep raw text accessible. Always store the original text alongside or linked to the embedding so you can re-embed.

Common Mistakes

Avoid these common pitfalls:
  • Wrong model for the task: Using a general-purpose model for code search, or a monolingual model for multilingual content.
  • Wrong distance metric: Using Euclidean distance when the model was trained for cosine similarity. Check your model's documentation.
  • No normalization: If your model does not normalize vectors and you use dot product, results will be wrong. Either normalize manually or use cosine similarity.
  • Truncation without awareness: Silently truncating text that exceeds the model's context window. You lose information without knowing it. Chunk intentionally instead.
  • Embedding structured data as-is: Feeding JSON or CSV directly to an embedding model. Convert to natural language first.
  • Ignoring asymmetric search: Some models (Voyage, Cohere, E5) use different prefixes for queries vs documents. Missing the prefix degrades quality significantly.
  • Not caching: Re-embedding unchanged text wastes money and adds latency.

Frequently Asked Questions

For most text retrieval tasks, 768–1536 dimensions are sufficient. Higher dimensions give diminishing returns. If cost is a concern, try 512 dimensions with OpenAI's text-embedding-3-small — it retains ~95% of the quality at 1/3 the storage.

Almost always chunks. A single embedding for a long document dilutes the meaning. Chunk into 200–500 token pieces so each embedding captures a specific topic or idea. This gives much better retrieval precision.

No. Embeddings from different models exist in completely different vector spaces. A query embedded with Model A cannot meaningfully be compared to documents embedded with Model B. Always use the same model for both query and document embeddings.

Use a multilingual embedding model like Cohere embed-multilingual-v3.0 or the multilingual variants of sentence-transformers. These models encode all languages into a shared vector space, enabling cross-lingual search.

With OpenAI's text-embedding-3-small at $0.02 per million tokens, embedding 1 million documents of ~500 tokens each costs about $10. Storage costs depend on your vector database. For zero API cost, use open-source sentence-transformers locally.