Intermediate
Practical Applications of Embeddings
Build real-world AI features with embeddings — from semantic search and recommendation systems to duplicate detection and full RAG pipelines.
1. Semantic Search
The most common embedding application. Search by meaning instead of keywords.
Python - Complete Semantic Search
from openai import OpenAI
import numpy as np
client = OpenAI()
# Your document collection
documents = [
"Python is widely used for data science and machine learning",
"JavaScript frameworks like React power modern web applications",
"PostgreSQL is a powerful open-source relational database",
"Docker containers simplify application deployment",
"Neural networks learn patterns from training data",
]
# Step 1: Embed all documents
doc_response = client.embeddings.create(
input=documents,
model="text-embedding-3-small"
)
doc_embeddings = np.array([d.embedding for d in doc_response.data])
# Step 2: Embed the search query
query = "How do I build AI models?"
query_response = client.embeddings.create(
input=[query],
model="text-embedding-3-small"
)
query_embedding = np.array(query_response.data[0].embedding)
# Step 3: Compute cosine similarity
similarities = np.dot(doc_embeddings, query_embedding) / (
np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)
# Step 4: Rank results
ranked_indices = np.argsort(similarities)[::-1]
print("Search results for:", query)
for i in ranked_indices:
print(f" [{similarities[i]:.4f}] {documents[i]}")
2. Document Similarity and Clustering
Group similar documents automatically using embeddings and clustering algorithms.
Python - Document Clustering
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
documents = [
"How to train a neural network",
"Deep learning frameworks comparison",
"Best pasta recipes for beginners",
"Italian cooking techniques",
"Python machine learning tutorial",
"Homemade pizza dough recipe",
]
# Embed documents
embeddings = model.encode(documents)
# Cluster into groups
n_clusters = 2
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(embeddings)
# Display clusters
for cluster_id in range(n_clusters):
print(f"\nCluster {cluster_id + 1}:")
for doc, label in zip(documents, labels):
if label == cluster_id:
print(f" - {doc}")
# Output:
# Cluster 1: (AI/ML documents)
# Cluster 2: (Cooking documents)
3. Recommendation System
Python - Content-Based Recommendations
from sentence_transformers import SentenceTransformer, util
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
# Product catalog
products = [
{"name": "Python Crash Course", "desc": "Learn Python programming from scratch"},
{"name": "Deep Learning with PyTorch", "desc": "Build neural networks with PyTorch"},
{"name": "Fluent Python", "desc": "Advanced Python programming techniques"},
{"name": "SQL Cookbook", "desc": "Practical SQL recipes for databases"},
{"name": "Hands-On ML with Scikit-Learn", "desc": "Machine learning with Python"},
]
# Embed product descriptions
descriptions = [p["desc"] for p in products]
product_embeddings = model.encode(descriptions)
def recommend(product_index, top_k=3):
"""Find similar products to the given product."""
query = product_embeddings[product_index]
scores = util.cos_sim(query, product_embeddings)[0]
# Exclude the product itself
scores[product_index] = -1
top_indices = scores.argsort(descending=True)[:top_k]
print(f"If you liked '{products[product_index]['name']}', try:")
for idx in top_indices:
print(f" - {products[idx]['name']} (similarity: {scores[idx]:.3f})")
recommend(0) # Recommend based on "Python Crash Course"
4. Duplicate Detection
Python - Near-Duplicate Detection
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
documents = [
"How to install Python on Windows",
"Installing Python on a Windows machine", # Near duplicate!
"Getting started with machine learning",
"Introduction to machine learning basics", # Near duplicate!
"Best restaurants in New York City",
]
embeddings = model.encode(documents)
# Find near-duplicates (similarity > threshold)
threshold = 0.85
similarity_matrix = util.cos_sim(embeddings, embeddings)
print("Near-duplicate pairs:")
for i in range(len(documents)):
for j in range(i + 1, len(documents)):
if similarity_matrix[i][j] > threshold:
print(f" [{similarity_matrix[i][j]:.3f}] '{documents[i]}'")
print(f" ≈ '{documents[j]}'")
5. Classification with Embeddings
Python - Zero-Shot Classification
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
# Define categories with example descriptions
categories = {
"technology": "Software, programming, computers, AI, and tech",
"sports": "Athletics, games, competitions, teams, and fitness",
"cooking": "Recipes, food preparation, ingredients, and cuisine",
}
# Embed category descriptions
cat_names = list(categories.keys())
cat_embeddings = model.encode(list(categories.values()))
# Classify new text
texts = [
"How to build a REST API with Flask",
"The championship game went into overtime",
"Sauté the onions until golden brown",
]
text_embeddings = model.encode(texts)
similarities = util.cos_sim(text_embeddings, cat_embeddings)
for i, text in enumerate(texts):
best_cat = cat_names[similarities[i].argmax()]
score = similarities[i].max().item()
print(f"'{text[:50]}...' → {best_cat} ({score:.3f})")
6. Anomaly Detection
Python - Embedding-Based Anomaly Detection
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
# Normal support tickets
tickets = [
"I cannot log into my account",
"Password reset is not working",
"My login credentials are invalid",
"Account access denied error",
"Forgot my password, need help",
"URGENT: Server room is on fire!", # Anomaly!
"Authentication failed after update",
]
embeddings = model.encode(tickets)
# Compute mean embedding (centroid)
centroid = embeddings.mean(axis=0)
# Compute distance from centroid for each ticket
distances = [np.linalg.norm(emb - centroid) for emb in embeddings]
# Flag anomalies (distance > mean + 2 * std)
mean_dist = np.mean(distances)
std_dist = np.std(distances)
threshold = mean_dist + 2 * std_dist
for ticket, dist in zip(tickets, distances):
status = "ANOMALY" if dist > threshold else "normal"
print(f"[{status:7s}] (dist: {dist:.3f}) {ticket}")
7. RAG: Connecting Embeddings to Vector Databases
The most impactful application of embeddings: Retrieval-Augmented Generation combines embedding-based search with LLM generation.
Python - Complete RAG Pipeline
from openai import OpenAI
import chromadb
client = OpenAI()
chroma = chromadb.PersistentClient(path="./rag_db")
# Step 1: Index your documents
collection = chroma.get_or_create_collection("knowledge_base")
documents = [
"Our return policy allows returns within 30 days of purchase.",
"Free shipping is available on orders over $50.",
"Premium members get 20% off all purchases.",
"Store hours are Monday-Friday 9am-9pm, Saturday 10am-6pm.",
]
collection.add(
ids=[f"doc_{i}" for i in range(len(documents))],
documents=documents
)
# Step 2: User asks a question
question = "Can I return something I bought last week?"
# Step 3: Find relevant documents
results = collection.query(query_texts=[question], n_results=2)
context = "\n".join(results["documents"][0])
# Step 4: Generate answer using LLM + retrieved context
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Answer based on this context:\n{context}"},
{"role": "user", "content": question}
]
)
print(f"Q: {question}")
print(f"A: {response.choices[0].message.content}")
RAG is the killer app for embeddings. It combines the knowledge stored in your documents with the reasoning ability of LLMs. Learn more in our RAG course and store your embeddings efficiently using vector databases.
💡 Build Your Own
Pick one application above and modify it with your own data. For example, build a semantic search engine for your personal notes, or a recommendation system for your favorite books.
The best way to learn is to build something real. Start small, iterate, and expand.