Intermediate

Hypothesis Testing

12 real interview questions on hypothesis testing — the statistical framework behind every A/B test and experiment at every major tech company.

Q1: What is a null hypothesis and an alternative hypothesis? Give an example.

💡
Model Answer: The null hypothesis (H₀) is the default assumption that there is no effect, no difference, or no relationship. The alternative hypothesis (H₁) is what you are trying to find evidence for — that there is an effect, difference, or relationship.

Example: Meta runs an A/B test on a new News Feed ranking algorithm.
• H₀: The new algorithm has no effect on average time spent (μ₁ = μ₂).
• H₁: The new algorithm changes average time spent (μ₁ ≠ μ₂).

We never "accept" H₀; we either reject H₀ (finding sufficient evidence for H₁) or fail to reject H₀ (insufficient evidence). This asymmetry is intentional: it sets a high bar for claiming a new discovery, reducing false positives. The legal analogy: innocent until proven guilty. H₀ is "innocent" and we need strong evidence to "convict."

Q2: What is a p-value? What does a p-value of 0.03 mean?

💡
Model Answer: A p-value is the probability of observing data as extreme or more extreme than what was actually observed, assuming the null hypothesis is true.

A p-value of 0.03 means: If H₀ were true (no real effect), there would be only a 3% chance of seeing data this extreme by random chance alone.

What it does NOT mean:
• NOT the probability that H₀ is true (that requires Bayesian analysis).
• NOT the probability that the result is due to chance.
• NOT the probability of making an error.

Decision rule: If p-value < α (typically 0.05), reject H₀. But p-values say nothing about effect size. A p-value of 0.001 with a tiny effect (e.g., 0.01% improvement in CTR) may be statistically significant but practically meaningless. Always report effect size alongside p-values. This is a question that trips up even experienced data scientists and is a favorite at Meta and Google.

Q3: What are Type I and Type II errors? Which is worse?

💡
Model Answer: Type I error (false positive): Rejecting H₀ when it is actually true. You conclude there is an effect when there is not. Probability = α (significance level, typically 0.05).

Type II error (false negative): Failing to reject H₀ when it is actually false. You miss a real effect. Probability = β. Statistical power = 1 - β (typically 0.80).

Which is worse depends on context:
Medical trials: Type II is often worse — missing an effective treatment means patients suffer unnecessarily.
Product launches: Type I is often worse — launching a feature that does not actually help wastes engineering resources and may harm user experience.
Fraud detection: Type I means flagging legitimate transactions (annoying customers). Type II means missing real fraud (financial loss). The tradeoff depends on the relative costs.

Key insight: There is a tradeoff. Lowering α (reducing Type I errors) increases β (more Type II errors). The only way to reduce both is to increase sample size.

Q4: When do you use a t-test vs. a z-test?

💡
Model Answer: Both test whether a mean (or difference in means) is significantly different from a hypothesized value.

Z-test: Use when (1) the population standard deviation σ is known, AND (2) the sample size is large (n > 30). The test statistic z = (X̄ - μ₀) / (σ/√n) follows N(0,1).

T-test: Use when (1) σ is unknown and estimated from the sample (s), OR (2) the sample size is small. The test statistic t = (X̄ - μ₀) / (s/√n) follows a t-distribution with n-1 degrees of freedom.

In practice: You almost always use a t-test because population σ is almost never known. For large n, the t-distribution converges to the normal distribution, so the results are nearly identical.

Types of t-tests:
One-sample: Is the sample mean different from a specific value?
Two-sample (independent): Are two group means different? (Used in A/B tests.)
Paired: Is there a difference within the same subjects before/after a treatment?

Q5: An A/B test shows a p-value of 0.04. The product manager says "There is only a 4% chance this result is wrong." Is this correct?

💡
Model Answer: No, this is incorrect. A p-value of 0.04 means: "If there were truly no difference between A and B, there would be a 4% probability of seeing a difference this large or larger by chance." It does NOT mean there is a 4% chance the result is wrong.

The probability that the result is wrong depends on:
• The base rate of true effects (what fraction of features actually work).
• The statistical power of the test.
• The significance level chosen.

Example using Bayes: Suppose only 10% of feature ideas actually improve the metric (prior). With α = 0.05 and power = 0.80:
P(true effect | significant) = P(significant | true effect) · P(true effect) / P(significant)
= (0.80 × 0.10) / (0.80 × 0.10 + 0.05 × 0.90)
= 0.08 / (0.08 + 0.045) = 0.08 / 0.125 = 0.64

So there is actually a 36% chance the significant result is a false positive! This is why replication, effect size, and prior probability all matter.

Q6: When do you use a chi-squared test?

💡
Model Answer: The chi-squared test is used for categorical data. There are two main variants:

1. Chi-squared goodness-of-fit test: Tests whether observed frequencies match expected frequencies from a hypothesized distribution. Example: Are website visits equally distributed across days of the week?
χ² = ∑ (Oᵢ - Eᵢ)² / Eᵢ with k-1 degrees of freedom.

2. Chi-squared test of independence: Tests whether two categorical variables are related. Example: Is there a relationship between user device type (mobile/desktop) and purchase completion (yes/no)?
χ² = ∑ (Oᵢⱼ - Eᵢⱼ)² / Eᵢⱼ with (r-1)(c-1) degrees of freedom.

Requirements: Expected frequency in each cell should be ≥ 5 (some sources say ≥ 10). If not, use Fisher's exact test instead.

In ML: Feature selection (chi-squared test between each feature and the target variable is a common filter method), evaluating whether model predictions are independent of a protected attribute (fairness testing).

Q7: What is ANOVA and when do you use it instead of multiple t-tests?

💡
Model Answer: ANOVA (Analysis of Variance) tests whether the means of three or more groups are all equal. H₀: μ₁ = μ₂ = ... = μₖ. H₁: At least one mean is different.

Why not just run multiple t-tests? With k groups, you would need C(k,2) pairwise t-tests. For k=5, that is 10 tests. If each uses α = 0.05, the probability of at least one false positive is 1 - (0.95)¹⁰ ≈ 0.40 — a 40% false positive rate! This is the multiple comparisons problem. ANOVA uses a single F-test to control the overall Type I error rate at α.

How it works: ANOVA compares between-group variance to within-group variance. F = MSₚₑₜ₪ₑₑₙ / MS₪ᵢₜₙᵢₙ. If groups have truly different means, between-group variance is large relative to within-group variance, yielding a large F-statistic.

After ANOVA: If the F-test is significant, use post-hoc tests (Tukey HSD, Bonferroni) to identify which groups differ. ANOVA only tells you that at least one differs, not which one.

Q8: What is the multiple comparisons problem and how do you correct for it?

💡
Model Answer: When you perform multiple statistical tests simultaneously, the probability of at least one false positive increases. With m independent tests at α = 0.05: P(at least one false positive) = 1 - (1 - α)ᵐ ≈ m · α for small α. With 20 tests: ≈ 64% chance of a false positive.

Correction methods:
Bonferroni correction: Use α/m as the threshold for each test. Simple but very conservative. With 20 tests: α′ = 0.05/20 = 0.0025.
Holm-Bonferroni: Step-down procedure. Less conservative than Bonferroni while still controlling family-wise error rate.
Benjamini-Hochberg (FDR): Controls the false discovery rate (expected proportion of false positives among rejected hypotheses) rather than family-wise error rate. More powerful, widely used in genomics and feature selection.

In practice at tech companies: When you test 50 metrics in an A/B test, you must correct for multiple comparisons. Meta, for example, designates a single "primary metric" that determines the launch decision, while secondary metrics use Bonferroni or FDR correction. Without correction, you will find "significant" effects that are pure noise.

Q9: What is statistical power and why does it matter?

💡
Model Answer: Statistical power = 1 - β = P(reject H₀ | H₁ is true). It is the probability of detecting a real effect when it exists. The standard target is 80%, meaning a 20% chance of missing a true effect.

Power depends on four factors:
Effect size: Larger effects are easier to detect. A 10% improvement in CTR is easier to detect than a 0.1% improvement.
Sample size (n): More data → more power. This is the main lever you control.
Significance level (α): More lenient α (e.g., 0.10 vs 0.05) → more power but more false positives.
Variance (σ²): Less noisy data → more power. Variance reduction techniques (CUPED, stratification) help here.

Why it matters: An underpowered experiment is a waste of time and traffic. If power is only 30%, you have a 70% chance of concluding "no effect" even when there is one. This leads companies to keep shipping inferior experiences because the test "showed no significant difference." At Google and Meta, power analysis is mandatory before launching any experiment.

Q10: What is the difference between one-tailed and two-tailed tests? When would you use each?

💡
Model Answer: Two-tailed test: H₁: μ ≠ μ₀. Tests for any difference (positive or negative). Rejection region is split across both tails. Use when you do not know the direction of the effect in advance.

One-tailed test: H₁: μ > μ₀ (or μ < μ₀). Tests for a difference in a specific direction. All α is in one tail, making it easier to reach significance for that direction. Use when you have a strong prior reason to expect only one direction.

In practice: Most A/B tests at tech companies use two-tailed tests because: (1) a treatment could hurt performance (you want to detect that too), (2) it is more conservative and scientifically rigorous, (3) choosing one-tailed after seeing data is p-hacking. Use one-tailed only when: the alternative direction is physically impossible or practically irrelevant, AND this decision is made before seeing data.

Numerical impact: For a given α, a one-tailed test has more power in the hypothesized direction. A z-score of 1.65 is significant at α = 0.05 one-tailed but not two-tailed (which requires 1.96).

Q11: You run a hypothesis test and get a p-value of 0.06. Your manager asks you to "just collect more data until it becomes significant." What is wrong with this approach?

💡
Model Answer: This is called optional stopping (or "peeking"), and it is a form of p-hacking that inflates the false positive rate far beyond the stated α.

Why it is wrong: If you repeatedly check significance and stop as soon as p < 0.05, you are not running one test at α = 0.05. You are running multiple tests. Simulation studies show that with continuous monitoring, the actual false positive rate can reach 20-30% or higher even with α = 0.05.

The correct approaches:
Fix sample size in advance: Use a power analysis to determine the required n before starting. Run the experiment to completion, then analyze.
Sequential testing: Use methods designed for repeated analysis, such as the O'Brien-Fleming spending function or always-valid p-values. These adjust the significance threshold at each peek to maintain overall α.
Group sequential designs: Pre-specify a small number of interim analyses (e.g., at 50% and 100% of target n) with adjusted α at each look.

At Google and Meta: Experimentation platforms typically either lock out results until the pre-determined sample size is reached, or use sequential testing frameworks that properly control the error rate under continuous monitoring.

Q12: Explain the difference between statistical significance and practical significance with an example.

💡
Model Answer: Statistical significance means the observed effect is unlikely to have occurred by chance (p < α). Practical significance means the effect is large enough to matter for the business or application.

Example: Amazon runs an A/B test on checkout button color. With 10 million users per variant, they detect a 0.01% increase in conversion rate (p < 0.001, highly significant). But 0.01% on a 3% base rate is negligible. The engineering cost to maintain two button variants probably exceeds the revenue gain. Statistically significant but NOT practically significant.

The reverse can also happen: A small startup tests a new feature with only 200 users. They see a 15% improvement in engagement but p = 0.12 (not significant at α = 0.05). The effect is practically large but the test was underpowered to detect it. Practically significant but NOT statistically significant.

Best practice: Before running an experiment, define the minimum detectable effect (MDE) — the smallest effect that would justify action. Size the experiment to have 80% power at the MDE. If the observed effect is significant but smaller than the MDE, it may not be worth acting on.