Advanced
Practical ML Questions
15 interview questions and model answers on real-world ML engineering challenges: feature engineering, missing data, data leakage, A/B testing, and production ML.
Q1: What is feature engineering and why is it considered so important?
Model Answer: Feature engineering is the process of creating, transforming, and selecting input features to improve model performance. It is considered the most impactful step in applied ML because even the best algorithm cannot compensate for poor features. Common techniques include: (1) Domain-specific features — ratios, aggregations, time-since-event. (2) Encoding categorical variables — one-hot, target encoding, frequency encoding. (3) Numerical transformations — log, square root, Box-Cox for skewed distributions. (4) Interaction features — products or ratios of existing features. (5) Time features — day of week, month, lag features, rolling averages. (6) Text features — TF-IDF, word embeddings, n-grams. (7) Binning continuous variables when the relationship is step-wise. The best feature engineers combine domain knowledge with systematic exploration. In competitions, feature engineering often matters more than model choice.
Q2: How do you handle missing data?
Model Answer: First, understand why data is missing: (1) MCAR (Missing Completely at Random) — missingness is unrelated to any variable. Safe to drop or impute. (2) MAR (Missing at Random) — missingness depends on observed variables. Can be modeled. (3) MNAR (Missing Not at Random) — missingness depends on the missing value itself (e.g., high-income people not reporting income). Hardest to handle. Strategies: (1) Deletion — drop rows (if few) or columns (if mostly missing). Loses information. (2) Simple imputation — mean, median (numerical), mode (categorical). Fast but ignores relationships. (3) Model-based imputation — KNN imputer, iterative imputer (MICE), using other features to predict the missing value. More accurate but risk of data leakage. (4) Indicator variable — add a binary "is_missing" feature alongside imputation. The missingness itself can be informative. (5) Algorithm-native handling — tree-based models (XGBoost, LightGBM) can handle missing values natively. Always fit imputers on training data only and apply to test data to prevent leakage.
Q3: What is data leakage and give a concrete example.
Model Answer: Data leakage is when information from outside the training set contaminates the model, creating overly optimistic performance that does not hold in production. Concrete example: Predicting hospital readmission. The feature "discharge_summary_mentions_follow_up" leaks because discharge summaries are written when doctors already know the patient is likely to return — this feature would not be available at the time you would want to make the prediction. Another example: Normalizing the entire dataset before splitting. If you compute the mean and standard deviation from all data (including test), the test set statistics influence the training data transformation. Time series example: Using tomorrow's weather to predict today's sales. How to prevent: (1) always ask "would this feature be available at prediction time?", (2) split data before any preprocessing, (3) use scikit-learn Pipelines to encapsulate preprocessing within cross-validation folds, (4) be especially careful with time-based features.
Q4: How do you approach A/B testing for ML models?
Model Answer: A/B testing for ML models compares the real-world impact of a new model against the current one (or no model). Key considerations: (1) Define the success metric — this should be a business metric (revenue, engagement, conversion), not just model accuracy. Offline metrics and online metrics often diverge. (2) Sample size and duration — use power analysis to determine how many users and how long you need to detect a meaningful difference. Account for novelty effects and day-of-week patterns. (3) Randomization unit — typically user-level, not request-level, to avoid inconsistent experiences. (4) Guard rails — monitor for regressions in key metrics (latency, error rates, engagement) and have automatic rollback mechanisms. (5) Statistical rigor — use proper hypothesis testing, correct for multiple comparisons, and avoid peeking at results too early (which inflates false positive rates). (6) Shadow mode — run the new model in parallel without serving its predictions to detect issues before full A/B testing.
Q5: What challenges arise when deploying ML models to production?
Model Answer: Production ML challenges beyond model development: (1) Model serving latency — models must respond within SLA (often <100ms). May require model compression, quantization, or distillation. (2) Data pipeline reliability — feature computation must match training exactly; training-serving skew is a top cause of production failures. (3) Model monitoring — track prediction distributions, feature distributions, and performance metrics over time. Detect data drift (input distribution changes) and concept drift (relationship between features and target changes). (4) Model retraining — when to retrain, how to validate before promotion, and how to handle A/B transitions. (5) Reproducibility — version control for data, code, models, and configurations. (6) Scalability — handling traffic spikes, batch vs real-time inference. (7) Fallback strategies — what happens when the model fails or times out? Always have a graceful degradation plan.
Q6: What is data drift and how do you detect it?
Model Answer: Data drift (also called covariate shift) occurs when the statistical distribution of production input data differs from the training data distribution. Types: (1) Feature drift — input distributions change (e.g., user demographics shift). (2) Concept drift — the relationship between inputs and outputs changes (e.g., buying patterns change during a pandemic). (3) Label drift — the target distribution changes. Detection methods: (1) Statistical tests — KS test, chi-squared test, PSI (Population Stability Index) comparing training vs production distributions. (2) Monitoring dashboards — track feature statistics (mean, variance, percentiles) and prediction distributions over time. (3) Performance monitoring — track model accuracy on labeled production data when available. (4) Adversarial validation — train a classifier to distinguish training from production data; high accuracy indicates drift. Response: retrain on recent data, update features, or trigger alerts for human review.
Q7: How do you handle categorical features with high cardinality?
Model Answer: High-cardinality categoricals (e.g., zip codes, user IDs, product SKUs) cannot use one-hot encoding (too many dimensions). Approaches: (1) Target encoding (mean encoding) — replace each category with the mean of the target for that category. Risk of overfitting; use with regularization or within cross-validation folds. (2) Frequency/count encoding — replace with the count of occurrences. No risk of target leakage. (3) Hashing trick — hash categories into a fixed number of bins. Loses interpretability but is memory-efficient and handles unseen categories. (4) Embeddings — learn dense vector representations (common in deep learning). Captures semantic relationships between categories. (5) Grouping rare categories — combine categories with few observations into an "Other" bucket. Reduces dimensionality and improves robustness. (6) Leave-one-out encoding — similar to target encoding but excludes the current row to reduce overfitting. CatBoost has built-in ordered target encoding that prevents leakage. Choose based on model type: tree models handle target encoding well; neural networks benefit from embeddings.
Q8: When would you use ensemble methods vs a single model?
Model Answer: Use ensembles when: (1) maximizing predictive performance is the priority (competitions, high-value predictions), (2) individual models have complementary strengths (e.g., one captures linear patterns, another captures interactions), (3) you need more robust predictions. Ensemble methods: Bagging (Random Forest) — reduces variance; Boosting (XGBoost) — reduces bias; Stacking — trains a meta-model on predictions from diverse base models; Blending/averaging — simple weighted average of predictions. Use a single model when: (1) interpretability is required (ensembles are harder to explain), (2) inference latency is critical (ensembles multiply computation), (3) the problem is simple enough that a single model suffices, (4) deployment and maintenance simplicity is valued. In production, the maintenance cost of ensembles is often underestimated. Many winning competition solutions use large ensembles, but production systems typically use a single well-tuned model for practical reasons.
Q9: What is the difference between online learning and batch learning?
Model Answer: Batch learning trains on the entire dataset at once, producing a fixed model that is retrained periodically (daily, weekly, monthly). Simple, reproducible, and well-understood. Most production ML systems use batch learning. Online learning (also called incremental learning) updates the model continuously as new data arrives, one sample or mini-batch at a time. Use online learning when: (1) data arrives as a stream and you cannot store it all, (2) the underlying pattern changes rapidly (fast concept drift), (3) the dataset is too large to fit in memory. Algorithms that support online learning: SGD-based models, Vowpal Wabbit, online random forests. Challenges of online learning: catastrophic forgetting (new data overwrites old patterns), harder to debug and reproduce, concept drift must be detected and handled, no clear evaluation protocol (no static test set). A middle ground is micro-batch learning: retrain frequently on recent data windows.
Q10: How do you approach feature scaling and when is it necessary?
Model Answer: Feature scaling normalizes features to comparable ranges. Methods: (1) Standardization (z-score): x' = (x - mean) / std. Centers at 0 with unit variance. Best general default. (2) Min-max scaling: x' = (x - min) / (max - min). Scales to [0,1]. Sensitive to outliers. (3) Robust scaling: uses median and IQR instead of mean/std. Handles outliers better. (4) Log transformation: for heavily skewed features. When necessary: algorithms that use distance metrics (KNN, SVM, K-Means) or gradient-based optimization (neural networks, logistic regression) require scaling because features with larger ranges dominate. When not necessary: tree-based models (decision trees, Random Forest, XGBoost) are invariant to feature scaling because they use threshold-based splits. Critical rule: fit the scaler on training data only, transform both train and test using training statistics. Fitting on all data causes data leakage.
Q11: What is training-serving skew and how do you prevent it?
Model Answer: Training-serving skew occurs when the features computed during model training differ from those computed during serving/inference, leading to degraded production performance. Common causes: (1) Different code paths — training features computed in Python/Spark, serving features computed in Java/C++. Even subtle differences (rounding, null handling) cause skew. (2) Different data sources — training uses batch data from a warehouse, serving uses real-time data from a different system. (3) Temporal differences — training uses point-in-time features, but serving uses current values. (4) Feature store inconsistencies — transformations not synchronized. Prevention: (1) use a feature store that serves identical features for training and inference, (2) share feature computation code between training and serving pipelines, (3) log serving features and compare distributions to training features, (4) run integration tests that compare feature values across both paths.
Q12: How do you decide whether a problem needs ML at all?
Model Answer: ML is not always the right solution. Use ML when: (1) the pattern is too complex for hand-crafted rules, (2) the pattern changes over time (requiring adaptation), (3) you have sufficient labeled data (or can obtain it), (4) the cost of errors is manageable. Do not use ML when: (1) simple rules or heuristics work well enough, (2) the problem is well-defined mathematically (use optimization or analytical solutions), (3) you have insufficient or poor-quality data, (4) the system needs to be fully explainable and deterministic, (5) the cost of building and maintaining an ML system exceeds its value. My framework: (1) start with a non-ML baseline (rules, statistics, simple thresholds), (2) quantify the gap between the baseline and acceptable performance, (3) estimate the cost of building and maintaining an ML solution, (4) only proceed with ML if the expected improvement justifies the complexity. Many successful "AI products" actually use ML for a small percentage of decisions, with rules handling the rest.
Q13: What is model interpretability and what techniques exist?
Model Answer: Model interpretability is the degree to which a human can understand and explain model predictions. Intrinsically interpretable models: linear regression (coefficients), decision trees (rules), logistic regression (log-odds). Post-hoc explanation techniques: (1) Feature importance — permutation importance (model-agnostic), tree-based importance (MDI, MDA). (2) SHAP (SHapley Additive exPlanations) — assigns each feature a contribution to each prediction based on game-theoretic Shapley values. Global and local explanations. (3) LIME (Local Interpretable Model-agnostic Explanations) — approximates the model locally with an interpretable model. Good for individual predictions. (4) Partial Dependence Plots — show the marginal effect of one feature on the prediction. (5) Attention weights (for neural networks) — though their faithfulness as explanations is debated. When interpretability is non-negotiable (healthcare, lending, criminal justice), either use intrinsically interpretable models or require rigorous post-hoc explanations.
Q14: How do you handle time series data differently from tabular data?
Model Answer: Time series has unique properties that require special handling: (1) Temporal ordering — never randomly shuffle time series data. Always split chronologically (train on past, test on future). Use time-based CV (expanding window or sliding window). (2) Autocorrelation — consecutive observations are correlated. Standard random CV overestimates performance because nearby (correlated) samples end up in both train and validation. (3) Stationarity — many models assume stable statistical properties. Apply differencing, detrending, or deseasonalizing to achieve stationarity. (4) Feature engineering — lag features, rolling statistics (mean, std over windows), time-based features (hour, day of week, holiday flags), and exponentially weighted averages. (5) Leakage risk — any feature using future information is leakage. Lag features must only look backward. (6) Specialized algorithms — ARIMA, Prophet, LSTM, Temporal Fusion Transformer, or tree models with lag features. The correct approach depends on seasonality, trend, and the forecast horizon.
Q15: What is model compression and when would you use it?
Model Answer: Model compression reduces model size and inference cost while maintaining acceptable performance. Use it when deploying to resource-constrained environments (mobile devices, edge computing) or when serving latency is critical. Techniques: (1) Knowledge distillation — train a smaller "student" model to mimic the predictions (soft labels) of a larger "teacher" model. The student often outperforms training directly on hard labels. (2) Quantization — reduce numerical precision from FP32 to FP16, INT8, or even binary weights. 2-4x speedup with minimal accuracy loss. (3) Pruning — remove redundant weights, neurons, or attention heads. Structured pruning (removing entire channels/layers) is hardware-friendly; unstructured pruning requires sparse computation support. (4) Architecture search — design inherently efficient architectures (MobileNet, EfficientNet, TinyBERT). (5) Low-rank factorization — decompose weight matrices into products of smaller matrices. In practice, quantization and distillation are the most commonly used and provide the best effort-to-benefit ratio.
Interview Tip: Practical ML questions reveal your real-world experience. Candidates who can discuss production challenges, data quality issues, and deployment tradeoffs stand out from those who only know textbook theory. Draw on your actual project experience whenever possible.
Lilly Tech Systems