Practice Questions & Tips
Rapid fire questions to test your breadth, presentation strategies for data science interviews, and a comprehensive FAQ to address common concerns.
Rapid Fire: 20 Questions & Answers
Practice answering these in 1-2 sentences each. In an interview, rapid fire rounds test breadth and quick recall across all data science topics.
| # | Question | Answer |
|---|---|---|
| 1 | What is the difference between a Type I and Type II error? | Type I is rejecting a true null hypothesis (false positive); Type II is failing to reject a false null hypothesis (false negative). |
| 2 | When would you use a non-parametric test? | When data is not normally distributed, ordinal, or has significant outliers that would distort parametric test assumptions. |
| 3 | What is the difference between R-squared and adjusted R-squared? | Adjusted R-squared penalizes for the number of predictors, preventing overfitting from adding irrelevant features; R-squared always increases with more features. |
| 4 | What is a confounding variable? | A variable that is correlated with both the independent and dependent variables, creating a spurious association between them. |
| 5 | What does INNER JOIN vs LEFT JOIN return? | INNER JOIN returns only matching rows from both tables; LEFT JOIN returns all rows from the left table with NULLs for non-matches from the right. |
| 6 | What is a window function in SQL? | A function that performs a calculation across a set of rows related to the current row, without collapsing them into a single output (unlike GROUP BY). |
| 7 | What is the difference between RANK() and DENSE_RANK()? | RANK() skips numbers after ties (1,1,3); DENSE_RANK() does not skip (1,1,2). |
| 8 | What is multicollinearity and why is it a problem? | When predictor variables are highly correlated with each other, making coefficient estimates unstable and difficult to interpret in linear regression. |
| 9 | What is the difference between L1 and L2 regularization? | L1 (Lasso) adds absolute value of coefficients as penalty and produces sparse solutions; L2 (Ridge) adds squared coefficients and shrinks all coefficients toward zero. |
| 10 | What is selection bias? | When the sample is not representative of the population due to systematic differences in who is included, leading to invalid conclusions. |
| 11 | What is survivorship bias? | Focusing only on entities that passed a selection process and overlooking those that did not, leading to overly optimistic conclusions. |
| 12 | How do you handle class imbalance? | Techniques include oversampling the minority class (SMOTE), undersampling the majority, adjusting class weights, using appropriate metrics (precision-recall AUC), or ensemble methods. |
| 13 | What is the difference between precision and recall? | Precision measures what fraction of positive predictions are correct (TP/(TP+FP)); recall measures what fraction of actual positives are found (TP/(TP+FN)). |
| 14 | What is a CTE in SQL? | A Common Table Expression is a temporary named result set defined with WITH that makes complex queries more readable and reusable within a single query. |
| 15 | What is the difference between mean, median, and mode? | Mean is the arithmetic average; median is the middle value (robust to outliers); mode is the most frequent value. Use median for skewed distributions like income. |
| 16 | What is standard error vs standard deviation? | Standard deviation measures the spread of individual data points; standard error measures the precision of the sample mean estimate (SD / sqrt(n)). |
| 17 | What is a p-value of 0.03 telling you? | There is a 3% probability of observing data this extreme (or more) if the null hypothesis is true; it does NOT mean there is a 3% chance the null is true. |
| 18 | What is the difference between a left-skewed and right-skewed distribution? | Right-skewed (positive) has a long tail to the right (mean > median); left-skewed (negative) has a long tail to the left (mean < median). |
| 19 | What is cross-validation? | A technique that splits data into K folds, trains on K-1 folds and validates on the held-out fold, rotating through all folds to get a robust estimate of model performance. |
| 20 | What is the difference between a data scientist and a data analyst? | Data analysts focus on describing what happened (reporting, dashboards); data scientists focus on predicting and prescribing (modeling, experimentation, causal inference). |
Presentation Tips for Data Science Interviews
How you communicate is as important as what you know. Follow these strategies to present your answers clearly and confidently:
-
Start with the Conclusion
Use the pyramid principle: state your answer or recommendation first, then provide supporting details. Interviewers have limited attention span. Example: "I would use a gradient boosted model because..." not "Let me walk you through all the algorithms I considered..."
-
Structure Your Answers
Before diving in, say "I am going to approach this in three parts..." This signals organized thinking and gives the interviewer a mental framework. It also helps you avoid rambling. For case studies, explicitly state your framework before applying it.
-
Think Out Loud
Interviewers want to see your reasoning process, not just the final answer. Narrate your thought process: "My first instinct is X, but I need to consider Y because..." This is especially important when you are stuck — showing how you work through uncertainty is more impressive than having an instant answer.
-
Use Concrete Numbers and Examples
Instead of "a large sample size," say "roughly 10,000 users per group." Instead of "the model performed well," say "the AUC was 0.85, which is a 12% improvement over the baseline." Specificity builds credibility.
-
Acknowledge Tradeoffs
Every decision in data science has tradeoffs. Mentioning them unprompted shows maturity. Example: "I would use logistic regression here because interpretability matters for this use case, even though a random forest might have higher accuracy."
-
Ask Clarifying Questions
Do not assume. For case studies: "What is the primary business objective?" For technical questions: "Are you asking about the mathematical definition or the practical application?" Good questions demonstrate analytical thinking and prevent you from solving the wrong problem.
-
Manage Your Time
In a 45-minute interview with 5 questions, you have about 7 minutes per question. If you are spending too long on one question, say "I can go deeper here, but in the interest of time, let me give you the key points and we can circle back." This shows awareness and respect for the interviewer's agenda.
Frequently Asked Questions
How long should I prepare for a data science interview?
For someone with a solid foundation, 4-6 weeks of focused preparation is typical. If you are transitioning from another field, plan for 8-12 weeks. Dedicate 2-3 hours daily: 1 hour for theory review, 1 hour for coding practice (SQL and pandas), and 30-60 minutes for case study practice. The most important thing is consistency — daily practice beats weekend cramming.
What SQL concepts are most tested in interviews?
In order of frequency: (1) JOINs (especially LEFT JOIN for anti-join patterns), (2) Window functions (ROW_NUMBER, RANK, LAG, LEAD, running totals), (3) GROUP BY with HAVING, (4) CTEs for readability, (5) Subqueries (correlated and non-correlated), (6) Date manipulation and aggregation. Self-joins and the "islands and gaps" problem are advanced patterns that distinguish strong candidates. Practice until you can solve medium-difficulty problems in 15 minutes.
Should I learn Bayesian statistics for interviews?
Yes, at a conceptual level. You should understand Bayes' theorem, priors, posteriors, and likelihoods. Know why Bayesian A/B testing is gaining popularity (it handles peeking and gives direct probability statements). You do not need to derive MCMC algorithms, but you should be able to explain when a Bayesian approach is preferred over frequentist and discuss the role of priors. Companies like Netflix and Spotify are particularly interested in Bayesian methods.
How do I handle a question I do not know the answer to?
Never fake it — experienced interviewers detect bluffing instantly. Instead: (1) Acknowledge what you do not know: "I have not worked with that specific technique." (2) Relate to what you do know: "But I am familiar with [similar concept] which addresses the same problem." (3) Reason through it: "Based on my understanding of [related topic], I would approach it by..." This shows intellectual honesty, learning ability, and problem-solving skills — all highly valued traits.
What is the most common reason candidates fail data science interviews?
Poor communication, not lack of knowledge. Most candidates who make it to onsite interviews have sufficient technical skills. They fail because: (1) they cannot explain concepts simply (overuse of jargon without substance), (2) they do not structure their answers (rambling without a clear point), (3) they do not connect theory to practice (textbook answers without real-world context), or (4) they do not ask clarifying questions for ambiguous problems (solving the wrong problem perfectly). Practice explaining concepts to a non-technical friend — if they understand, you are ready.
How important is domain knowledge for data science interviews?
It depends on the company and role. For product-focused DS roles (Meta, Airbnb, Spotify), understanding the product and its metrics is crucial. Research the company's business model, key metrics, and recent product changes before your interview. For more general DS roles, interviewers care more about your ability to learn a domain quickly. In either case, showing curiosity about the domain during the interview (by asking thoughtful questions about the business context) is always a positive signal.
Take-home assignment vs live coding: how to prepare differently?
For take-homes: focus on code quality, documentation, visualization, and storytelling. Present findings in a Jupyter notebook with clear markdown sections. Include your thought process, not just results. Handle edge cases and missing data explicitly. Time management is key — if given 4 hours, do not spend 3 on data cleaning. For live coding: practice under time pressure. Get comfortable with SQL and pandas without looking up syntax. Practice thinking out loud while coding. Start with a brute-force approach and optimize only if time permits. In both formats, always state your assumptions explicitly.
What resources do you recommend for preparation?
For statistics: "Naked Statistics" by Charles Wheelan for intuition, then Khan Academy for formal review. For SQL: LeetCode (Database section), StrataScratch, and DataLemur — aim for 100+ problems. For case studies: "Ace the Data Science Interview" by Nick Singh and Kevin Huo covers the major question types. For Python/pandas: Kaggle notebooks and real-world datasets. For mock interviews: Interviewing.io or practice with friends in the industry. Most importantly, build a portfolio project that demonstrates your end-to-end skills (data collection, cleaning, analysis, modeling, presentation).
Lilly Tech Systems