Advanced

Best Practices & Checklist

This final lesson consolidates everything into actionable checklists, common pitfalls to avoid, a debugging framework for recommendation quality issues, and answers to the most frequently asked questions by recommendation system engineers.

Production Deployment Checklist

Use this checklist before launching or updating any recommendation system in production.

Data Pipeline

Training data validation: Check for data freshness (is the latest training data less than 24 hours old?), label quality (are click labels joined correctly?), and feature coverage (what percentage of users/items have all features populated?).
Feature drift monitoring: Set up alerts for when feature distributions shift significantly from training time. A mean shift of more than 2 standard deviations should trigger investigation.
Data deduplication: Ensure the same user-item interaction is not counted multiple times (e.g., from retry logic, duplicate events).
Negative sampling: Verify your negative sampling strategy. Random negatives are biased — consider using in-batch negatives or hard negatives for better training signal.

Model Quality

Offline metrics baseline: Record NDCG@10, Recall@50, and MAP@10 on a held-out test set. Any new model must beat this baseline.
Segment-level evaluation: Evaluate separately for power users (50+ interactions), casual users (5–50), and cold-start users (<5). A model that improves average NDCG might degrade cold-start quality.
Latency budget: Candidate generation <10ms, feature fetching <5ms, ranking <15ms, post-processing <5ms. Total end-to-end p99 <100ms.
Fallback strategy: What happens when the model service is down? Pre-computed recommendations or popularity-based fallbacks must be ready.

Business Rules

Out-of-stock filtering: Items with zero stock must never appear in recommendations.
Already-purchased filtering: For non-consumable items, filter out items the user already owns.
Content policy: Age-restricted, recalled, or policy-violating items must be filtered before serving.
Category diversity: No single category should dominate more than 40% of recommendations.
Sponsored content disclosure: Boosted or sponsored items must be clearly labeled in the UI.

Monitoring

Serving latency: p50, p95, p99 latency dashboards with alerts at 2x baseline.
Recommendation quality: Daily CTR, conversion rate, and revenue per session tracked per experiment group.
Coverage and diversity: Weekly report on catalog coverage and intra-list diversity.
Error rates: Feature fetch failures, model inference errors, cache miss rates.
Cold start metrics: Separate dashboards for new user and new item recommendation quality.

💡

Apply at work: Print this checklist and review it with your team before every production deployment. The most common production incidents come from missing one of these items: stale features, out-of-stock items appearing in recommendations, or a model degradation that went unnoticed because segment-level metrics were not tracked.

Common Pitfalls

Pitfall	Symptom	Fix
Training on biased data	Model predicts what was shown, not what users actually want	Use position bias correction (IPW), counterfactual evaluation, or randomized exploration data for training
Popularity bias	Model only recommends popular items, long-tail items never surface	Add popularity-normalized scores, exploration via bandits, and track coverage metrics
Feature leakage	Offline NDCG is suspiciously high but online CTR does not improve	Ensure no future-looking features in training data (e.g., item popularity computed after the interaction timestamp)
Cold start neglect	New users see irrelevant recommendations and churn	Implement explicit cold start strategy with popularity fallbacks and onboarding
Optimizing wrong metric	CTR improves but revenue and retention decline	Use multi-objective optimization, track guardrail metrics, measure long-term impact
Stale models	Recommendations feel outdated, trending items missing	Retrain at least daily, add real-time session features, refresh candidate pools hourly
No fallback	Entire recommendation section disappears during outages	Pre-compute fallback recommendations, cache aggressively, design for graceful degradation

Debugging Recommendation Quality

When recommendation quality degrades, follow this systematic debugging framework to identify the root cause.

class RecommendationDebugger:
    """Systematic debugger for recommendation quality issues.

    Follow this top-down debugging flow:
    1. Is the issue with candidate generation or ranking?
    2. Is it a data issue, model issue, or serving issue?
    3. Is it affecting all users or a specific segment?
    """
    def diagnose(self, user_id: str, request_context: dict) -> dict:
        report = {}

        # Step 1: Check candidate generation
        candidates = self.candidate_generator.generate(user_id, request_context)
        report["candidate_count"] = len(candidates)
        report["candidate_sources"] = self._source_breakdown(candidates)

        # Are relevant items in the candidate set?
        known_good_items = self.get_known_good_items(user_id)
        candidates_set = {c["item_id"] for c in candidates}
        report["known_good_in_candidates"] = len(known_good_items & candidates_set)
        report["known_good_total"] = len(known_good_items)

        if report["known_good_in_candidates"] == 0:
            report["diagnosis"] = "CANDIDATE_GENERATION_FAILURE"
            report["action"] = "Relevant items not in candidate set. Check ANN index freshness, embedding quality, and candidate source diversity."
            return report

        # Step 2: Check feature quality
        features = self.feature_service.get_features(user_id, candidates[:10])
        report["feature_coverage"] = self._check_feature_coverage(features)
        report["feature_freshness"] = self._check_feature_freshness(features)

        if report["feature_coverage"] < 0.8:
            report["diagnosis"] = "FEATURE_QUALITY_ISSUE"
            report["action"] = f"Only {report['feature_coverage']*100:.0f}% feature coverage. Check feature pipeline for missing data."
            return report

        # Step 3: Check ranking scores
        ranked = self.ranker.rank_with_debug(user_id, candidates, request_context)
        report["top_10_scores"] = [r["score"] for r in ranked[:10]]
        report["score_spread"] = max(report["top_10_scores"]) - min(report["top_10_scores"])

        if report["score_spread"] < 0.01:
            report["diagnosis"] = "RANKING_DIFFERENTIATION_FAILURE"
            report["action"] = "Ranking scores are too similar. Model may be undertrained or features may not be discriminative."
            return report

        # Step 4: Check business rules impact
        pre_rules = [r["item_id"] for r in ranked[:20]]
        post_rules = self.business_rules.apply(ranked[:50], request_context)
        post_rules_ids = [r["item_id"] for r in post_rules[:20]]
        report["items_removed_by_rules"] = len(set(pre_rules) - set(post_rules_ids))

        if report["items_removed_by_rules"] > 10:
            report["diagnosis"] = "BUSINESS_RULES_TOO_AGGRESSIVE"
            report["action"] = f"{report['items_removed_by_rules']} of top 20 removed by business rules. Review rule configuration."
            return report

        # Step 5: Check for position/diversity changes
        report["diagnosis"] = "NO_OBVIOUS_ISSUE"
        report["action"] = "Pipeline looks healthy. Check if user preferences changed, or if the issue is subjective. Review A/B test metrics."

        return report

    def _source_breakdown(self, candidates):
        from collections import Counter
        return dict(Counter(c.get("source", "unknown") for c in candidates))

    def _check_feature_coverage(self, features):
        total_features = sum(len(f) for f in features.values())
        non_null = sum(1 for f in features.values() for v in f.values() if v is not None)
        return non_null / total_features if total_features > 0 else 0

    def _check_feature_freshness(self, features):
        import time
        stale_count = 0
        for f in features.values():
            last_updated = f.get("_last_updated", 0)
            if time.time() - last_updated > 86400:  # More than 1 day old
                stale_count += 1
        return {"stale_features": stale_count, "total": len(features)}

Frequently Asked Questions

What is the minimum amount of data needed to build a useful recommendation system? ▼

For content-based recommendations, you can start with zero interaction data — you only need item features (title, category, description). For collaborative filtering, you need at least 10–20 interactions per active user and 50+ interactions per item for reasonable quality. This typically means 10K+ total interactions across 1K+ users and 500+ items. Below this threshold, use popularity-based and content-based approaches exclusively.

Should I use implicit feedback (clicks, views) or explicit feedback (ratings)? ▼

Use implicit feedback. Explicit ratings are sparse (only 1–5% of users rate items) and biased (users tend to rate items they feel strongly about). Implicit signals are abundant and reflect actual behavior. Use a hierarchy of signal strength: purchase > add-to-cart > click > view. Weight stronger signals higher in your training data. The exception is when you have a product where ratings are the core interaction (like a movie rating app).

How often should I retrain my recommendation model? ▼

Retrain at least daily for most use cases. For fast-moving inventories (news, social media), train every few hours or use online learning. For stable catalogs (movies, books), daily retraining is sufficient. Monitor NDCG on a rolling test set — if it drops more than 2% between retrains, increase the retraining frequency. Also retrain immediately after significant events (sales, new product launches, seasonal changes).

What embedding dimension should I use? ▼

Start with 64 or 128 dimensions. Below 64, you lose too much expressiveness. Above 256, the gains are minimal but storage and compute costs increase significantly. The optimal dimension depends on your catalog complexity: 64 works well for under 100K items, 128 for 100K–10M items, and 256 for 10M+ items. Always experiment: train models with different dimensions and compare recall@k on your validation set.

How do I handle recommendations for a marketplace with millions of sellers? ▼

Add seller-level features and fairness constraints. Use seller diversity as a business rule (no more than 3 items from the same seller in top 10). Include seller quality signals (rating, fulfillment speed, return rate) as ranking features. For seller fairness, implement exposure-based fairness where each seller gets impressions proportional to their catalog quality. This prevents a winner-take-all dynamic and keeps the marketplace healthy for sellers.

Can I use LLMs for recommendations? ▼

LLMs can enhance recommendations in specific ways but are not a replacement for the full pipeline. Use LLMs for: generating explanations ("We recommend this because..."), understanding natural language queries ("find me a gift for my tech-savvy dad"), and cold-start item understanding (extracting features from product descriptions). Do not use LLMs for: scoring millions of candidates (too slow and expensive), or replacing collaborative filtering signals (LLMs do not have access to your user interaction data). The sweet spot is using LLMs in the post-processing layer for explanation generation and in the content-based pipeline for feature extraction.

What is the biggest mistake teams make when building recommendation systems? ▼

Spending too much time on model architecture and too little on data quality, feature engineering, and evaluation infrastructure. The difference between a good and great recommendation system is rarely the model — it is the quality of training data, the richness of features (especially real-time session features), and the ability to measure impact through robust A/B testing. Start with a simple model (logistic regression or a small neural network), invest heavily in features and evaluation, then iterate on the model architecture once you have a solid foundation.

How do I convince my team to invest in diversity and fairness features? ▼

Frame it in business terms: (1) Diversity increases long-term retention — Spotify and Netflix both found that diverse recommendations reduce churn. (2) Fairness in marketplaces keeps sellers on the platform — without fair exposure, sellers leave and your catalog shrinks. (3) Regulatory risk is real — the EU Digital Services Act requires transparency in recommendation systems. Run an A/B test comparing diverse vs non-diverse recommendations and track 30-day retention, not just CTR. The data will make the case for you.

Course Summary

You now have the knowledge to design, build, and operate a production-grade recommendation system. Here is the roadmap from zero to production:

Week	Focus	Deliverable
1	Data + Baseline	Interaction data pipeline, popularity-based baseline, offline evaluation framework
2	Candidate Generation	Item-to-item CF + FAISS index deployed, recall@50 measured
3	Ranking Model	Pointwise ranker with 30+ features, NDCG@10 beating baseline
4	Real-Time Features	Session features in Redis, cold-start strategy, personalization API
5	Business Rules + Diversity	MMR diversity, business rule engine, out-of-stock filtering
6	Evaluation + Launch	A/B test framework, monitoring dashboards, production deployment

💡

Apply at work: Share this 6-week roadmap with your team as a project plan. Each week has a concrete deliverable that builds on the previous one. The most common mistake is trying to build everything at once. Ship a simple system early (weeks 1–3), measure its impact, then iterate on diversity, real-time features, and advanced models.

← Previous Offline & Online Evaluation Course Home → Designing Recommendation Systems