Advanced

A/B Testing Mathematics

10 questions on the math behind A/B testing — the most practically important topic in statistics interviews at Google, Meta, Amazon, and every data-driven tech company.

Q1: How do you calculate the required sample size for an A/B test?

💡
Model Answer: For a two-sample z-test comparing proportions (e.g., conversion rates), the sample size per group is:

n = (zₐ/₂ + zᵣ)² · (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)²

Where zₐ/₂ is the critical value for significance level α (1.96 for α=0.05 two-tailed), zᵣ is the critical value for power 1-β (0.84 for 80% power), and p₁, p₂ are the expected proportions.

Simplified approximation: For detecting a relative change δ from baseline rate p:
n ≈ 16 · p(1-p) / (p · δ)² = 16(1-p) / (p · δ²)

Example: Baseline CTR = 5%, MDE = 10% relative increase (to 5.5%), α = 0.05, power = 80%:
n ≈ 16 × 0.05 × 0.95 / (0.005)² = 16 × 0.0475 / 0.000025 = 30,400 per group

Key insight: Sample size is proportional to 1/δ². Halving the detectable effect requires 4x the sample size. This is why detecting small improvements in already-optimized products requires massive experiments.

Q2: Explain power analysis. Your team wants to detect a 2% improvement in click-through rate from a baseline of 10%. How many users do you need?

💡
Model Answer: Power analysis determines the sample size needed to detect a given effect size with specified confidence. The four interconnected parameters are: (1) effect size, (2) sample size, (3) significance level α, (4) power 1-β. Fixing any three determines the fourth.

For this problem:
• Baseline p₁ = 0.10 (10% CTR)
• Expected p₂ = 0.102 (10.2% CTR, a 2% relative increase)
• α = 0.05 (two-tailed), so zₐ/₂ = 1.96
• Power = 80%, so zᵣ = 0.84

n = (1.96 + 0.84)² × (0.10 × 0.90 + 0.102 × 0.898) / (0.002)²
= 7.84 × (0.09 + 0.09164) / 0.000004
= 7.84 × 0.18164 / 0.000004
= 7.84 × 45,410
356,000 per group

Total: ~712,000 users. At 10,000 daily active users, this experiment would take ~71 days. If the team cannot wait that long, options include: increasing MDE (accept only larger improvements), using variance reduction (CUPED), or using a more sensitive metric.

Q3: What is sequential testing and why is it important?

💡
Model Answer: Sequential testing allows you to monitor results continuously and stop the experiment early if the evidence is overwhelming, while still controlling the Type I error rate.

The problem with naive peeking: If you check significance daily and stop when p < 0.05, you will reject a true null much more than 5% of the time. With daily checks over 30 days, the actual false positive rate can reach 25%+.

Sequential testing methods:
Group sequential (O'Brien-Fleming): Pre-specify K interim analyses. Use stricter thresholds early (e.g., p < 0.005 at first look) and more lenient at the end (p < 0.048 at final look). Total α is still controlled at 0.05.
Alpha spending functions: Generalization that allocates α across continuous monitoring, not just fixed looks. Pocock and O'Brien-Fleming are common spending functions.
Always-valid p-values (Confidence sequences): Modern approach where the p-value is valid at any stopping time. Uses mixture martingale theory. Allows truly continuous monitoring.
Bayesian sequential testing: Compute the posterior probability that treatment is better and stop when it exceeds a threshold (e.g., P(B > A) > 0.95).

In practice: Google, Meta, and Netflix all use some form of sequential testing in their experimentation platforms. It reduces experiment duration by 20-30% on average while maintaining statistical rigor.

Q4: What is the multi-armed bandit problem and how does it differ from A/B testing?

💡
Model Answer: The multi-armed bandit problem is about allocating resources among competing options to maximize total reward, balancing exploration (learning which option is best) with exploitation (using the current best option).

A/B testing vs. Multi-armed bandits:
A/B testing: Fixed allocation (50/50), full exploration period, then full exploitation. Statistically rigorous but incurs "regret" during the exploration phase (half the users see the worse variant).
Multi-armed bandits: Adaptive allocation — gradually shift traffic toward the winning variant. Less regret during the experiment but weaker statistical guarantees.

Common algorithms:
Epsilon-greedy: Exploit the best arm with probability 1-ε, explore randomly with probability ε. Simple but not optimal.
UCB (Upper Confidence Bound): Play the arm with the highest upper confidence bound: X̄ᵢ + c√(ln(n)/nᵢ). Balances mean reward with uncertainty.
Thompson Sampling: Maintain a Beta distribution for each arm. Sample from each, play the arm with the highest sample. Bayesian, empirically strong, simple to implement.

When to use bandits over A/B testing: When the cost of exploration is high (e.g., showing bad recommendations), when you have many variants to test (e.g., 100 ad creatives), or when the environment is non-stationary.

Q5: What is CUPED and how does it reduce variance in A/B tests?

💡
Model Answer: CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique developed at Microsoft and widely used at Google, Meta, and Netflix. It uses pre-experiment data to reduce noise in the treatment effect estimate.

How it works: Instead of comparing raw means Ȳₜ - Ȳ⁾, use the adjusted metric:
Ȳₜᶜᵈ = Ȳₜ - θ · (X̄ₜ - X̄⁾)

where X is a pre-experiment covariate (e.g., the same metric measured before the experiment) and θ = Cov(X, Y) / Var(X) is chosen to minimize variance.

Variance reduction: Var(Yᶜᵈ) = Var(Y) · (1 - ρ²), where ρ is the correlation between the pre-experiment covariate X and the experiment metric Y. If ρ = 0.5, variance is reduced by 25%. If ρ = 0.8, variance is reduced by 64%.

Practical impact: A 50% variance reduction means you need only half the sample size (or half the experiment duration) to achieve the same power. For metrics like revenue or engagement that are highly correlated with pre-experiment values, CUPED routinely achieves 30-60% variance reduction.

Why it is valid: Since the treatment assignment is random and independent of pre-experiment data, the adjustment does not introduce bias. It only removes noise that is predictable from pre-experiment behavior.

Q6: What is the novelty effect and how do you account for it in A/B tests?

💡
Model Answer: The novelty effect occurs when users engage more with a new feature simply because it is new, not because it is better. The effect typically fades after days or weeks, and the true long-term impact may be much smaller (or even negative).

How to detect it:
• Plot the treatment effect over time. If it starts large and decays, novelty is likely present.
• Compare the effect for new users (who have no prior experience) vs. existing users. If existing users show a larger initial effect that decays, that is novelty.
• Segment by user tenure in the experiment (cohort analysis). Users who have been in the experiment longer should show the "true" long-term effect.

How to account for it:
Run longer experiments: Wait for the novelty to wear off (typically 2-4 weeks).
Use the "burn-in" approach: Ignore the first week of data and analyze only subsequent data.
New-user-only analysis: Analyze only users who joined after the experiment started — they have no "before" experience to contrast with.
Holdback experiments: After launching, keep a small control group to monitor long-term effects.

The reverse also exists: Primacy effect — users resist change initially, then adapt. The treatment effect starts negative and improves over time. Both must be considered when interpreting A/B test results.

Q7: What is the network effect problem in A/B testing and how do you handle it?

💡
Model Answer: The network effect (or interference / SUTVA violation) occurs when one user's treatment assignment affects another user's outcome. Standard A/B testing assumes SUTVA (Stable Unit Treatment Value Assumption) — that each user's outcome depends only on their own treatment, not on others' treatments.

Examples of SUTVA violations:
Social networks: If treatment users share more content, control users see more content too, inflating the control group's engagement.
Marketplace: If treatment sellers get more visibility, control sellers lose sales (zero-sum competition).
Messaging: A change to sender behavior affects receivers, who may be in the control group.

Solutions:
Cluster randomization: Randomize at the level of clusters (geographic regions, social clusters) instead of individuals. Users within a cluster are all in the same group, reducing interference between groups.
Ego-cluster randomization: Each user and their immediate social network form a cluster. Randomize at this level.
Switchback experiments: Alternate between treatment and control over time (common in ride-sharing marketplaces).
Ghost ads / counterfactual logging: For ad experiments, log what would have been shown to estimate causal effects without interference.

At Meta: The standard approach for social features is graph-cluster randomization, where the social graph is partitioned into clusters using community detection algorithms.

Q8: You run an A/B test and get a significant result with p = 0.02. Your colleague points out that you tested 10 metrics. Is the result still valid?

💡
Model Answer: Probably not. Testing 10 metrics at α = 0.05 gives a family-wise error rate of 1 - (0.95)¹⁰ ≈ 0.40 — a 40% chance of at least one false positive. The p = 0.02 result may be one of those false positives.

Corrections to apply:
Bonferroni: Adjusted threshold = 0.05/10 = 0.005. The p = 0.02 is NOT significant after Bonferroni correction.
Holm-Bonferroni: Sort p-values, compare smallest to 0.05/10, next to 0.05/9, etc. Less conservative but still rejects this result.
Benjamini-Hochberg (FDR): Controls false discovery rate rather than family-wise error. Sort p-values, compare the k-th smallest to k × 0.05/10. More powerful, may retain the result depending on other p-values.

Better approach (pre-registration):
• Designate ONE primary metric before the experiment. This is the metric that determines the launch decision — no correction needed.
• All other metrics are secondary/exploratory. Apply FDR correction to these.
• Document the analysis plan in advance. This prevents "finding" significance by searching across metrics (p-hacking).

At Google and Meta: Experimentation platforms typically require pre-registration of the primary metric and automatically apply corrections to secondary metrics.

Q9: How do you handle ratio metrics (like revenue per user) in A/B tests?

💡
Model Answer: Ratio metrics (revenue/user, clicks/impression, time/session) require special care because the denominator varies across users, making the metric non-IID.

The problem: Simply averaging per-user ratios can be misleading. A user with 1 session and $100 revenue (ratio = $100) biases the average more than a user with 100 sessions and $200 revenue (ratio = $2).

Two approaches:
Ratio of averages: Δ = (total revenue in treatment / total users in treatment) - (total revenue in control / total users in control). This weights each user proportionally to their denominator. Use the delta method for variance estimation: Var(Δ) ≈ (1/n) [Var(Y) - 2r · Cov(Y,X) + r² · Var(X)] / E[X]², where X is the denominator metric and Y is the numerator.

Linearization: Transform the ratio metric into a per-user linear metric: Yᵢ′ = Yᵢ - r̂ · Xᵢ, where r̂ is the overall ratio. Then apply a standard t-test to Y′. This is mathematically equivalent to the delta method but simpler to implement.

Which to use: For most tech company experiments, linearization is preferred because it integrates cleanly with CUPED and stratification. The delta method is used when you need explicit variance formulas for power calculations.

Q10: What are the key assumptions of a standard A/B test, and what happens when each is violated?

💡
Model Answer:
1. Random assignment (SUTVA): Each user is independently assigned to treatment or control, and one user's assignment does not affect another's outcome.
Violation: Network effects, marketplace competition. Fix: Cluster randomization.

2. No interference between groups: Treatment users do not affect control users and vice versa.
Violation: Shared resources (cache, server), social interactions. Fix: Physical isolation, graph-based randomization.

3. Fixed sample size (no peeking): The sample size is determined before the experiment and the test is run once at the end.
Violation: Continuous monitoring, stopping early when significant. Fix: Sequential testing, always-valid p-values.

4. Single hypothesis (no multiple testing): One primary metric is tested at α = 0.05.
Violation: Testing many metrics, many variants. Fix: Bonferroni, BH correction, pre-registration of primary metric.

5. Independence of observations: Each data point is independent.
Violation: Same user appears multiple times (repeat visits). Fix: Randomize at user level (not session level), use user-level metrics, cluster-robust standard errors.

6. Sufficient sample size (CLT): The test statistic is approximately normal.
Violation: Small samples, heavily skewed metrics (revenue). Fix: Bootstrap, permutation tests, metric transformations (log, winsorization).