Best Practices & Checklist
This final lesson consolidates everything into actionable checklists, common pitfalls to avoid, a debugging framework for recommendation quality issues, and answers to the most frequently asked questions by recommendation system engineers.
Production Deployment Checklist
Use this checklist before launching or updating any recommendation system in production.
Data Pipeline
- Training data validation: Check for data freshness (is the latest training data less than 24 hours old?), label quality (are click labels joined correctly?), and feature coverage (what percentage of users/items have all features populated?).
- Feature drift monitoring: Set up alerts for when feature distributions shift significantly from training time. A mean shift of more than 2 standard deviations should trigger investigation.
- Data deduplication: Ensure the same user-item interaction is not counted multiple times (e.g., from retry logic, duplicate events).
- Negative sampling: Verify your negative sampling strategy. Random negatives are biased — consider using in-batch negatives or hard negatives for better training signal.
Model Quality
- Offline metrics baseline: Record NDCG@10, Recall@50, and MAP@10 on a held-out test set. Any new model must beat this baseline.
- Segment-level evaluation: Evaluate separately for power users (50+ interactions), casual users (5–50), and cold-start users (<5). A model that improves average NDCG might degrade cold-start quality.
- Latency budget: Candidate generation <10ms, feature fetching <5ms, ranking <15ms, post-processing <5ms. Total end-to-end p99 <100ms.
- Fallback strategy: What happens when the model service is down? Pre-computed recommendations or popularity-based fallbacks must be ready.
Business Rules
- Out-of-stock filtering: Items with zero stock must never appear in recommendations.
- Already-purchased filtering: For non-consumable items, filter out items the user already owns.
- Content policy: Age-restricted, recalled, or policy-violating items must be filtered before serving.
- Category diversity: No single category should dominate more than 40% of recommendations.
- Sponsored content disclosure: Boosted or sponsored items must be clearly labeled in the UI.
Monitoring
- Serving latency: p50, p95, p99 latency dashboards with alerts at 2x baseline.
- Recommendation quality: Daily CTR, conversion rate, and revenue per session tracked per experiment group.
- Coverage and diversity: Weekly report on catalog coverage and intra-list diversity.
- Error rates: Feature fetch failures, model inference errors, cache miss rates.
- Cold start metrics: Separate dashboards for new user and new item recommendation quality.
Common Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Training on biased data | Model predicts what was shown, not what users actually want | Use position bias correction (IPW), counterfactual evaluation, or randomized exploration data for training |
| Popularity bias | Model only recommends popular items, long-tail items never surface | Add popularity-normalized scores, exploration via bandits, and track coverage metrics |
| Feature leakage | Offline NDCG is suspiciously high but online CTR does not improve | Ensure no future-looking features in training data (e.g., item popularity computed after the interaction timestamp) |
| Cold start neglect | New users see irrelevant recommendations and churn | Implement explicit cold start strategy with popularity fallbacks and onboarding |
| Optimizing wrong metric | CTR improves but revenue and retention decline | Use multi-objective optimization, track guardrail metrics, measure long-term impact |
| Stale models | Recommendations feel outdated, trending items missing | Retrain at least daily, add real-time session features, refresh candidate pools hourly |
| No fallback | Entire recommendation section disappears during outages | Pre-compute fallback recommendations, cache aggressively, design for graceful degradation |
Debugging Recommendation Quality
When recommendation quality degrades, follow this systematic debugging framework to identify the root cause.
class RecommendationDebugger:
"""Systematic debugger for recommendation quality issues.
Follow this top-down debugging flow:
1. Is the issue with candidate generation or ranking?
2. Is it a data issue, model issue, or serving issue?
3. Is it affecting all users or a specific segment?
"""
def diagnose(self, user_id: str, request_context: dict) -> dict:
report = {}
# Step 1: Check candidate generation
candidates = self.candidate_generator.generate(user_id, request_context)
report["candidate_count"] = len(candidates)
report["candidate_sources"] = self._source_breakdown(candidates)
# Are relevant items in the candidate set?
known_good_items = self.get_known_good_items(user_id)
candidates_set = {c["item_id"] for c in candidates}
report["known_good_in_candidates"] = len(known_good_items & candidates_set)
report["known_good_total"] = len(known_good_items)
if report["known_good_in_candidates"] == 0:
report["diagnosis"] = "CANDIDATE_GENERATION_FAILURE"
report["action"] = "Relevant items not in candidate set. Check ANN index freshness, embedding quality, and candidate source diversity."
return report
# Step 2: Check feature quality
features = self.feature_service.get_features(user_id, candidates[:10])
report["feature_coverage"] = self._check_feature_coverage(features)
report["feature_freshness"] = self._check_feature_freshness(features)
if report["feature_coverage"] < 0.8:
report["diagnosis"] = "FEATURE_QUALITY_ISSUE"
report["action"] = f"Only {report['feature_coverage']*100:.0f}% feature coverage. Check feature pipeline for missing data."
return report
# Step 3: Check ranking scores
ranked = self.ranker.rank_with_debug(user_id, candidates, request_context)
report["top_10_scores"] = [r["score"] for r in ranked[:10]]
report["score_spread"] = max(report["top_10_scores"]) - min(report["top_10_scores"])
if report["score_spread"] < 0.01:
report["diagnosis"] = "RANKING_DIFFERENTIATION_FAILURE"
report["action"] = "Ranking scores are too similar. Model may be undertrained or features may not be discriminative."
return report
# Step 4: Check business rules impact
pre_rules = [r["item_id"] for r in ranked[:20]]
post_rules = self.business_rules.apply(ranked[:50], request_context)
post_rules_ids = [r["item_id"] for r in post_rules[:20]]
report["items_removed_by_rules"] = len(set(pre_rules) - set(post_rules_ids))
if report["items_removed_by_rules"] > 10:
report["diagnosis"] = "BUSINESS_RULES_TOO_AGGRESSIVE"
report["action"] = f"{report['items_removed_by_rules']} of top 20 removed by business rules. Review rule configuration."
return report
# Step 5: Check for position/diversity changes
report["diagnosis"] = "NO_OBVIOUS_ISSUE"
report["action"] = "Pipeline looks healthy. Check if user preferences changed, or if the issue is subjective. Review A/B test metrics."
return report
def _source_breakdown(self, candidates):
from collections import Counter
return dict(Counter(c.get("source", "unknown") for c in candidates))
def _check_feature_coverage(self, features):
total_features = sum(len(f) for f in features.values())
non_null = sum(1 for f in features.values() for v in f.values() if v is not None)
return non_null / total_features if total_features > 0 else 0
def _check_feature_freshness(self, features):
import time
stale_count = 0
for f in features.values():
last_updated = f.get("_last_updated", 0)
if time.time() - last_updated > 86400: # More than 1 day old
stale_count += 1
return {"stale_features": stale_count, "total": len(features)}
Frequently Asked Questions
For content-based recommendations, you can start with zero interaction data — you only need item features (title, category, description). For collaborative filtering, you need at least 10–20 interactions per active user and 50+ interactions per item for reasonable quality. This typically means 10K+ total interactions across 1K+ users and 500+ items. Below this threshold, use popularity-based and content-based approaches exclusively.
Use implicit feedback. Explicit ratings are sparse (only 1–5% of users rate items) and biased (users tend to rate items they feel strongly about). Implicit signals are abundant and reflect actual behavior. Use a hierarchy of signal strength: purchase > add-to-cart > click > view. Weight stronger signals higher in your training data. The exception is when you have a product where ratings are the core interaction (like a movie rating app).
Retrain at least daily for most use cases. For fast-moving inventories (news, social media), train every few hours or use online learning. For stable catalogs (movies, books), daily retraining is sufficient. Monitor NDCG on a rolling test set — if it drops more than 2% between retrains, increase the retraining frequency. Also retrain immediately after significant events (sales, new product launches, seasonal changes).
Start with 64 or 128 dimensions. Below 64, you lose too much expressiveness. Above 256, the gains are minimal but storage and compute costs increase significantly. The optimal dimension depends on your catalog complexity: 64 works well for under 100K items, 128 for 100K–10M items, and 256 for 10M+ items. Always experiment: train models with different dimensions and compare recall@k on your validation set.
Add seller-level features and fairness constraints. Use seller diversity as a business rule (no more than 3 items from the same seller in top 10). Include seller quality signals (rating, fulfillment speed, return rate) as ranking features. For seller fairness, implement exposure-based fairness where each seller gets impressions proportional to their catalog quality. This prevents a winner-take-all dynamic and keeps the marketplace healthy for sellers.
LLMs can enhance recommendations in specific ways but are not a replacement for the full pipeline. Use LLMs for: generating explanations ("We recommend this because..."), understanding natural language queries ("find me a gift for my tech-savvy dad"), and cold-start item understanding (extracting features from product descriptions). Do not use LLMs for: scoring millions of candidates (too slow and expensive), or replacing collaborative filtering signals (LLMs do not have access to your user interaction data). The sweet spot is using LLMs in the post-processing layer for explanation generation and in the content-based pipeline for feature extraction.
Spending too much time on model architecture and too little on data quality, feature engineering, and evaluation infrastructure. The difference between a good and great recommendation system is rarely the model — it is the quality of training data, the richness of features (especially real-time session features), and the ability to measure impact through robust A/B testing. Start with a simple model (logistic regression or a small neural network), invest heavily in features and evaluation, then iterate on the model architecture once you have a solid foundation.
Frame it in business terms: (1) Diversity increases long-term retention — Spotify and Netflix both found that diverse recommendations reduce churn. (2) Fairness in marketplaces keeps sellers on the platform — without fair exposure, sellers leave and your catalog shrinks. (3) Regulatory risk is real — the EU Digital Services Act requires transparency in recommendation systems. Run an A/B test comparing diverse vs non-diverse recommendations and track 30-day retention, not just CTR. The data will make the case for you.
Course Summary
You now have the knowledge to design, build, and operate a production-grade recommendation system. Here is the roadmap from zero to production:
| Week | Focus | Deliverable |
|---|---|---|
| 1 | Data + Baseline | Interaction data pipeline, popularity-based baseline, offline evaluation framework |
| 2 | Candidate Generation | Item-to-item CF + FAISS index deployed, recall@50 measured |
| 3 | Ranking Model | Pointwise ranker with 30+ features, NDCG@10 beating baseline |
| 4 | Real-Time Features | Session features in Redis, cold-start strategy, personalization API |
| 5 | Business Rules + Diversity | MMR diversity, business rule engine, out-of-stock filtering |
| 6 | Evaluation + Launch | A/B test framework, monitoring dashboards, production deployment |
Lilly Tech Systems