Patterns & Tips
A comprehensive reference for statistical coding patterns, numerical precision pitfalls, and interview strategy. Use this as a quick reference before your interview.
Statistical Coding Patterns Cheat Sheet
These patterns appear in nearly every statistics coding problem. Memorize them.
# ============================================================
# PATTERN 1: Two-Pass vs One-Pass Statistics
# ============================================================
# TWO-PASS (simple and stable):
def variance_two_pass(data):
n = len(data)
mean = sum(data) / n # Pass 1
return sum((x - mean)**2 for x in data) / (n-1) # Pass 2
# ONE-PASS (Welford's - numerically stable):
def variance_one_pass(data):
n = 0; mean = 0; M2 = 0
for x in data:
n += 1
delta = x - mean
mean += delta / n
M2 += delta * (x - mean)
return M2 / (n - 1)
# AVOID (catastrophic cancellation for large values):
def variance_NAIVE_DONT_USE(data):
n = len(data)
return sum(x**2 for x in data)/n - (sum(data)/n)**2 # BAD!
# ============================================================
# PATTERN 2: Log-Space Computation
# ============================================================
import math
# When multiplying many small probabilities:
# WRONG: product = p1 * p2 * p3 * ... * pn (underflows to 0)
# RIGHT: log_product = log(p1) + log(p2) + ... + log(pn)
# Log-sum-exp trick for log(exp(a) + exp(b)):
def log_sum_exp(a, b):
max_val = max(a, b)
return max_val + math.log(math.exp(a - max_val) + math.exp(b - max_val))
# For a list of values:
def log_sum_exp_list(values):
max_val = max(values)
return max_val + math.log(sum(math.exp(v - max_val) for v in values))
# Log-factorial for combinatorics:
def log_factorial(n):
return sum(math.log(i) for i in range(1, n + 1))
# Log-binomial coefficient:
def log_comb(n, k):
return log_factorial(n) - log_factorial(k) - log_factorial(n - k)
# ============================================================
# PATTERN 3: Sampling With and Without Replacement
# ============================================================
import random
def sample_with_replacement(data, k):
"""Bootstrap-style sampling."""
return [data[random.randint(0, len(data)-1)] for _ in range(k)]
def sample_without_replacement(data, k):
"""Fisher-Yates shuffle, take first k elements."""
arr = list(data)
n = len(arr)
for i in range(min(k, n)):
j = random.randint(i, n - 1)
arr[i], arr[j] = arr[j], arr[i]
return arr[:k]
# ============================================================
# PATTERN 4: Resampling-Based Inference
# ============================================================
def bootstrap_statistic(data, stat_func, num_bootstrap=10000):
"""General pattern for bootstrap anything."""
n = len(data)
stats = []
for _ in range(num_bootstrap):
sample = [data[random.randint(0, n-1)] for _ in range(n)]
stats.append(stat_func(sample))
stats.sort()
return {
"mean": sum(stats) / len(stats),
"ci_95": (stats[int(0.025*num_bootstrap)], stats[int(0.975*num_bootstrap)])
}
def permutation_p_value(a, b, stat_func, num_perms=10000):
"""General pattern for permutation test."""
observed = stat_func(a, b)
pooled = a + b
count = 0
for _ in range(num_perms):
random.shuffle(pooled)
perm_stat = stat_func(pooled[:len(a)], pooled[len(a):])
if perm_stat >= observed:
count += 1
return count / num_perms
# ============================================================
# PATTERN 5: Bayesian Update
# ============================================================
def bayesian_update_discrete(prior, likelihood):
"""Discrete Bayes: posterior = normalize(prior * likelihood)."""
unnorm = {h: prior[h] * likelihood[h] for h in prior}
total = sum(unnorm.values())
return {h: p / total for h, p in unnorm.items()}
def beta_binomial_update(alpha, beta, successes, failures):
"""Conjugate update: Beta(a,b) + Binom data -> Beta(a+s, b+f)."""
return alpha + successes, beta + failures
Numerical Precision Guide
These are the precision pitfalls that catch candidates in interviews. Know them and you will avoid the traps.
import math
# ============================================================
# PITFALL 1: Catastrophic Cancellation
# ============================================================
# When subtracting two nearly-equal large numbers, relative error explodes.
# Example: variance of [1e9 + 1, 1e9 + 2, 1e9 + 3]
# Naive: E[X^2] - E[X]^2 = 1e18 + ... - 1e18 - ... (cancellation!)
# Correct: use two-pass or Welford's algorithm
# ============================================================
# PITFALL 2: Floating-Point Comparison
# ============================================================
# WRONG: if a == b
# RIGHT: if abs(a - b) < epsilon
def approx_equal(a, b, rel_tol=1e-9, abs_tol=1e-12):
"""Compare floats safely."""
return abs(a - b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol)
# ============================================================
# PITFALL 3: Probability Underflow
# ============================================================
# Multiplying 1000 probabilities of 0.1 each:
# 0.1^1000 = 1e-1000 (underflows to 0.0 in float64!)
# Solution: work in log-space
# log(0.1^1000) = 1000 * log(0.1) = -2302.6 (perfectly representable)
# ============================================================
# PITFALL 4: Log of Zero
# ============================================================
# log(0) = -inf, which breaks computations.
# Solution: add a small epsilon
LOG_EPS = 1e-300 # smallest positive float64 ~ 5e-324
safe_log = lambda x: math.log(max(x, LOG_EPS))
# ============================================================
# PITFALL 5: Overflow in Exponentials
# ============================================================
# exp(1000) overflows to inf in float64.
# Solution: the log-sum-exp trick
# log(exp(a) + exp(b)):
# WRONG: math.log(math.exp(a) + math.exp(b)) # exp(1000) = inf
# RIGHT: max(a,b) + log(exp(a-max) + exp(b-max)) # stable!
# ============================================================
# PITFALL 6: Integer Overflow in Combinatorics
# ============================================================
# 100! has 158 digits. Python handles big integers natively,
# but log-space is still better for probabilities.
# C(1000, 500) is astronomically large, but:
# log(C(1000,500)) = sum(log(i) for i in 501..1000) - sum(log(i) for i in 1..500)
# This is perfectly representable as a float.
# ============================================================
# PITFALL 7: Kahan Summation for Precision
# ============================================================
def kahan_sum(values):
"""Compensated summation - reduces error from O(n*eps) to O(eps)."""
total = 0.0
compensation = 0.0
for x in values:
y = x - compensation
t = total + y
compensation = (t - total) - y
total = t
return total
# When does Kahan matter?
# Summing 1e6 values of ~1.0: normal sum error ~ 1e-10
# Kahan sum error ~ 1e-16 (machine epsilon)
# For most interview problems, regular sum is fine.
# Mention Kahan if the interviewer asks about precision.
Interview Strategy
Step 1: Write the Formula
Before writing any code, write the mathematical formula as a comment. This shows the interviewer you understand the method, and it serves as your implementation roadmap. Example: # t = (x_bar - mu) / (s / sqrt(n))
Step 2: Handle Edge Cases First
Check for empty input, n=1 (where variance is undefined with Bessel's correction), division by zero (zero variance), and negative parameters. Handling these upfront shows maturity.
Step 3: Implement from Scratch
Translate each component of the formula into code. Use helper functions for reusable computations (mean, variance). Name variables to match the mathematical notation.
Step 4: Verify with Known Values
Always test with values where you know the answer: perfect correlation (r=1), standard normal CDF at 0 (0.5), fair coin flips, etc. Mention the scipy equivalent for verification.
Quick Reference: Common Formulas
| Statistic | Formula | Python (from scratch) |
|---|---|---|
| Mean | sum(x) / n | sum(data) / len(data) |
| Sample Variance | sum((x-mean)^2) / (n-1) | sum((x-mu)**2 for x in data) / (n-1) |
| Pearson r | cov(X,Y) / (std(X)*std(Y)) | (n*sum_xy - sum_x*sum_y) / sqrt(...) |
| Normal PDF | (1/sqrt(2pi*s^2)) * exp(-(x-mu)^2 / (2s^2)) | math.exp(-(x-mu)**2 / (2*s**2)) / (s*sqrt(2*pi)) |
| t-statistic | (x_bar - mu) / (s / sqrt(n)) | (mean_a - mean_b) / math.sqrt(var_a/n_a + var_b/n_b) |
| Chi-squared | sum((O-E)^2 / E) | sum((o-e)**2/e for o,e in zip(obs, exp)) |
| Bayes update | P(H|E) = P(E|H)*P(H) / P(E) | post[h] = prior[h]*lik[h] / sum(...) |
| Beta update | Beta(a+k, b+n-k) | alpha += successes; beta += failures |
Frequently Asked Questions
Use sample variance (n-1) when your data is a sample from a larger population and you want to estimate the population variance. This is the case in virtually all interview problems. Use population variance (n) only when you have the entire population (rare). The reason for n-1 (Bessel's correction) is that the sample mean is computed from the same data, making it biased-low: E[sum((x_i - x_bar)^2)/n] = (n-1)/n * sigma^2. Dividing by n-1 corrects this bias. In an interview, always clarify which one is expected, and explain the difference.
Use a parametric test (t-test, z-test) when: (1) the data is approximately normally distributed (or n > 30 by CLT), (2) the test statistic has a known distribution, (3) you want a quick closed-form answer. Use a non-parametric test (permutation, bootstrap) when: (1) the data is not normal (skewed, heavy-tailed), (2) sample size is small, (3) you have a custom test statistic that doesn't have a known distribution, (4) you want exact p-values without distributional assumptions. In tech company interviews, mention both approaches. In quant firm interviews, implement both and compare results.
Frequentist: Compute a p-value = P(data this extreme | null hypothesis). If p < 0.05, reject null. Problems: (1) p-values are confusing ("probability of the data, not the hypothesis"), (2) requires fixed sample sizes determined in advance, (3) peeking at results inflates false positive rate, (4) cannot say "B is better with 95% probability." Bayesian: Compute P(B better than A | data) directly using posterior distributions. Advantages: (1) direct probability statements, (2) no need for fixed sample sizes, (3) early stopping is fine, (4) quantifies risk (expected loss). Disadvantages: (1) requires choosing a prior, (2) computationally more expensive. In practice, most companies are moving toward Bayesian A/B testing (Netflix, Booking.com, VWO).
The standard approach is: (1) Use the relationship CDF(x) = 0.5 * (1 + erf(x / sqrt(2))), (2) Implement the error function using the Abramowitz & Stegun approximation (formula 7.1.26), which gives ~7 digits of accuracy. The key constants are a1=0.254829592, a2=-0.284496736, a3=1.421413741, a4=-1.453152027, a5=1.061405429, p=0.3275911. The formula is: erf(x) = 1 - (a1*t + a2*t^2 + ... + a5*t^5) * exp(-x^2) where t = 1/(1 + p*|x|). This is what NumPy uses internally, and it is the expected answer at quant firms.
Welford's algorithm computes running mean and variance in a single pass with O(1) space and perfect numerical stability. The recurrence is: when value x arrives, delta = x - mean, mean += delta/n, delta2 = x - mean (new mean), M2 += delta * delta2. Variance = M2/n (population) or M2/(n-1) (sample). It matters because: (1) it is an online algorithm (processes data one element at a time), (2) it uses O(1) space, (3) it avoids the catastrophic cancellation of E[X^2] - E[X]^2, (4) it is the most-asked statistics coding question at quant firms. The naive formula E[X^2] - E[X]^2 fails when values are large (e.g., 1e9 + small noise) because squaring creates numbers too large to subtract accurately.
Monte Carlo estimators converge at rate O(1/sqrt(n)) by the Central Limit Theorem. This means: (1) doubling accuracy requires 4x more samples, (2) for 2 decimal places (~0.01 error), you need ~10,000 samples, (3) for 4 decimal places, you need ~100,000,000 samples. To estimate the error, compute the standard error: SE = std(samples) / sqrt(n). The 95% confidence interval for the estimate is mean +/- 1.96*SE. Variance reduction techniques (importance sampling, control variates, antithetic variables, stratified sampling) can improve convergence but are beyond most interview scope. Always set a random seed for reproducibility.
Based on publicly shared interview experiences from Jane Street, Citadel, Two Sigma, DE Shaw, and Jump Trading: (1) Implement Welford's online variance (tests numerical stability), (2) Monte Carlo option pricing (tests simulation + finance), (3) Implement a t-test from scratch (tests statistical knowledge), (4) Birthday problem / Coupon collector (tests probability reasoning), (5) Bayesian coin bias estimation (tests Bayesian thinking), (6) Bootstrap confidence interval (tests resampling), (7) Implement normal CDF from scratch (tests numerical methods), (8) Random walk properties (tests probability theory), (9) Permutation test (tests non-parametric methods), (10) Thompson sampling / multi-armed bandit (tests decision-making under uncertainty). The common thread: every question requires implementing the method from scratch and explaining the mathematical foundations.
Course Summary
- Descriptive Statistics: Mean, median, mode, variance, percentiles, correlation, Welford's online algorithm
- Distributions: Normal (PDF, CDF via erf), Binomial (log-space), Poisson, Exponential, Box-Muller sampling
- Hypothesis Testing: Welch's t-test, chi-squared, A/B test z-test, permutation test, bootstrap CI (percentile + BCa)
- Monte Carlo: Pi estimation, option pricing, probability puzzles, random walks, Metropolis-Hastings MCMC
- Bayesian Methods: Discrete Bayes, Beta-Binomial conjugate, Naive Bayes classifier, Bayesian A/B test, Thompson sampling
- Core Skill: Every method implemented from scratch, from the mathematical formula, with numerical stability
Lilly Tech Systems