Advanced

Bayesian Statistics

10 advanced interview questions on Bayesian methods — from foundational concepts to practical applications in ML systems at top tech companies.

Q1: Explain the Bayesian framework: prior, likelihood, posterior, and evidence.

💡
Model Answer: The Bayesian framework updates beliefs about a parameter θ given observed data D:

P(θ|D) = P(D|θ) · P(θ) / P(D)

Prior P(θ): Your belief about θ before seeing data. Example: before any A/B test, you might believe a feature has a 10% chance of improving CTR by more than 1%.
Likelihood P(D|θ): The probability of observing the data given a specific θ. This encodes how well θ explains what you observed.
Posterior P(θ|D): Your updated belief about θ after seeing data. It combines prior knowledge with evidence.
Evidence P(D): The total probability of the data across all possible θ values. Acts as a normalizing constant: P(D) = ∫ P(D|θ) P(θ) dθ.

Intuition: Posterior ∝ Likelihood × Prior. The posterior is a compromise between what you believed before (prior) and what the data tells you (likelihood). With little data, the prior dominates. With lots of data, the likelihood dominates and the posterior converges to the true value regardless of the prior.

Q2: What are conjugate priors and why are they useful?

💡
Model Answer: A prior is conjugate to a likelihood if the posterior has the same distributional family as the prior. This makes Bayesian updating analytically tractable — you just update the parameters instead of doing complex integration.

Common conjugate pairs:
Beta prior + Binomial likelihood → Beta posterior: For estimating probabilities (e.g., click-through rates). Beta(α, β) + observing k successes in n trials → Beta(α + k, β + n - k).
Normal prior + Normal likelihood → Normal posterior: For estimating means.
Gamma prior + Poisson likelihood → Gamma posterior: For estimating rates.
Dirichlet prior + Multinomial likelihood → Dirichlet posterior: For estimating category probabilities (e.g., topic models).

Example: You want to estimate the CTR of a new ad. Your prior is Beta(2, 98) — expecting roughly 2% CTR. After 1,000 impressions with 30 clicks: posterior = Beta(2+30, 98+970) = Beta(32, 1068). Posterior mean = 32/1100 ≈ 2.9%. The prior pulled the estimate slightly toward 2%, providing regularization.

In ML: Thompson sampling for multi-armed bandits uses Beta-Binomial conjugacy. LDA topic models use Dirichlet-Multinomial conjugacy.

Q3: What is MAP estimation and how does it relate to MLE and regularization?

💡
Model Answer: Maximum A Posteriori (MAP) estimation finds the parameter value that maximizes the posterior distribution:
θₘₐₚ = argmax P(θ|D) = argmax P(D|θ) · P(θ)

Maximum Likelihood Estimation (MLE) finds the parameter that maximizes only the likelihood:
θₘₗₑ = argmax P(D|θ)

Relationship: MAP = MLE with a prior acting as a regularizer. Taking the log:
θₘₐₚ = argmax [log P(D|θ) + log P(θ)]

• If P(θ) is Gaussian N(0, σ²), then log P(θ) ∝ -||θ||², and MAP becomes L2 regularization (Ridge regression).
• If P(θ) is Laplace, then log P(θ) ∝ -||θ||₁, and MAP becomes L1 regularization (Lasso).
• If P(θ) is uniform (flat prior), then MAP = MLE.

Key insight: Every regularized model can be interpreted as Bayesian inference with a specific prior. This connects the frequentist practice of regularization to the Bayesian framework.

Q4: Compare Bayesian vs. Frequentist approaches. When would you prefer each?

💡
Model Answer:
Frequentist approach: Parameters are fixed but unknown. Probability refers to long-run frequency of events. Uses p-values and confidence intervals. Does not incorporate prior knowledge.

Bayesian approach: Parameters have probability distributions reflecting uncertainty. Probability represents degree of belief. Uses posterior distributions and credible intervals. Explicitly incorporates prior knowledge.

Prefer Frequentist when:
• You have large amounts of data (priors do not matter much)
• Regulatory requirements demand standard procedures (clinical trials)
• You need simple, well-understood methods (t-tests, ANOVA)
• Computational resources are limited

Prefer Bayesian when:
• You have small sample sizes and informative prior knowledge
• You need to quantify uncertainty in parameters (not just point estimates)
• You want to update beliefs incrementally as data arrives
• You need to make decisions under uncertainty (multi-armed bandits, Bayesian optimization)
• The problem has a natural hierarchical structure (hierarchical Bayesian models)

In practice: Most tech companies use frequentist methods for standard A/B testing but Bayesian methods for personalization, recommendation systems, and scenarios with limited data per entity.

Q5: You have a coin and you want to estimate the probability of heads. You flip it 10 times and get 7 heads. What is the Bayesian estimate using a uniform prior?

💡
Model Answer: Setup: Let p = P(heads). The uniform prior on [0,1] is Beta(1,1). The likelihood of 7 heads in 10 flips is Binomial(10, p).

Posterior: By conjugacy, Beta(1,1) + 7 successes in 10 trials = Beta(1+7, 1+3) = Beta(8, 4).

Posterior mean: α/(α+β) = 8/12 = 0.667. This is the Bayesian point estimate under squared error loss.

MAP estimate: (α-1)/(α+β-2) = 7/10 = 0.70. This equals the MLE because the uniform prior is flat.

95% credible interval: The central 95% of Beta(8,4) is approximately [0.39, 0.89]. This interval directly says "there is a 95% probability that p lies in this range" — unlike a frequentist confidence interval.

Comparison with MLE: MLE = 7/10 = 0.70. The Bayesian posterior mean (0.667) is slightly pulled toward 0.5 by the uniform prior. With more data (e.g., 70 heads in 100 flips), both would converge: posterior mean = 71/102 ≈ 0.696, MLE = 0.70.

Q6: What is the difference between a credible interval and a confidence interval?

💡
Model Answer: 95% Credible interval (Bayesian): "There is a 95% probability that the parameter lies in this interval." This is a direct probability statement about the parameter given the data and prior.

95% Confidence interval (Frequentist): "If we repeated this experiment many times and computed a CI each time, 95% of those CIs would contain the true parameter." For any specific interval, the parameter is either in it or not — you cannot say there is a 95% probability.

Why this matters: The credible interval gives you what most people actually want — a probability statement about the parameter. The confidence interval gives you a statement about the procedure, not the parameter.

Example: A 95% CI for mean revenue is [$45, $55].
• Frequentist: "Our procedure captures the true mean 95% of the time." The true mean is either in [$45, $55] or it is not. We just do not know which.
• Bayesian: "Given our data and prior, P($45 ≤ μ ≤ $55) = 0.95."

In practice: With flat priors and large samples, credible and confidence intervals are often numerically identical. The distinction matters most when priors are informative or sample sizes are small.

Q7: Explain Bayesian optimization and where it is used in ML.

💡
Model Answer: Bayesian optimization is a strategy for optimizing expensive-to-evaluate black-box functions. It maintains a surrogate model (typically a Gaussian Process) of the objective function and uses an acquisition function to decide where to evaluate next.

How it works:
1. Start with a few random evaluations of f(x).
2. Fit a Gaussian Process to the observed (x, f(x)) pairs. This gives a posterior mean (prediction) and variance (uncertainty) at every point.
3. Use an acquisition function to find the next x to evaluate. Common choices: Expected Improvement (EI), Upper Confidence Bound (UCB), Probability of Improvement (PI).
4. Evaluate f at the chosen x, update the GP, repeat.

Where it is used in ML:
Hyperparameter tuning: Learning rate, regularization strength, architecture choices. Much more sample-efficient than random search or grid search.
Neural architecture search (NAS): Finding optimal network architectures.
AutoML: Libraries like Optuna and Hyperopt use variants of Bayesian optimization.

Key advantage: It balances exploitation (evaluating near the current best) with exploration (evaluating uncertain regions). This is the same explore-exploit tradeoff that appears in multi-armed bandits and reinforcement learning.

Q8: What is the Beta distribution and why is it natural for modeling probabilities?

💡
Model Answer: The Beta distribution Beta(α, β) is a continuous distribution on [0, 1], making it a natural choice for modeling unknown probabilities. PDF: f(p) ∝ pᵅ²⁻¹ · (1-p)ᵝ⁻¹.

Properties:
• Mean = α/(α+β)
• Mode = (α-1)/(α+β-2) for α, β > 1
• Variance = αβ/[(α+β)²(α+β+1)]

Interpretations:
• Beta(1,1) = Uniform(0,1) — no prior knowledge
• Beta(10,10) — prior centered at 0.5 with moderate confidence
• Beta(1,99) — prior strongly favoring small probabilities (e.g., rare event rates)
• α and β can be thought of as "pseudo-counts" of prior successes and failures

In practice: Beta priors are used everywhere probabilities are estimated: click-through rates, conversion rates, email open rates, user churn probabilities. Thompson sampling for multi-armed bandits maintains a Beta distribution for each arm's success probability and samples from each to decide which arm to pull.

Q9: A new ad has received 3 clicks in 100 impressions. The historical average CTR for similar ads is 5%. How would you estimate the true CTR using Bayesian methods?

💡
Model Answer: Step 1 — Choose a prior: The historical 5% CTR suggests Beta(α₀, β₀) with mean = α₀/(α₀+β₀) = 0.05. We need to choose the "strength" of the prior. If we trust it moderately, set α₀+β₀ = 100 (equivalent to 100 pseudo-observations): Beta(5, 95).

Step 2 — Update with data: 3 clicks in 100 impressions. Posterior = Beta(5+3, 95+97) = Beta(8, 192).

Step 3 — Posterior estimates:
• Posterior mean = 8/200 = 0.04 = 4.0%
• MLE (data only) = 3/100 = 3.0%
• Prior mean = 5.0%

The Bayesian estimate (4.0%) is a compromise between the data (3.0%) and the prior (5.0%), weighted by the amount of information in each. The prior contributes 100 pseudo-observations and the data contributes 100 real observations, so they have equal weight.

Why this is better than MLE: With only 100 impressions, the MLE of 3% is noisy. The 95% CI for a proportion is roughly 3% ± 3.3%. The Bayesian approach incorporates the prior knowledge that similar ads get ~5%, regularizing the estimate and reducing variance.

Q10: What is the Naive Bayes classifier and why does it work despite its "naive" assumption?

💡
Model Answer: Naive Bayes classifies by applying Bayes' theorem with the "naive" assumption of conditional independence between features given the class:

P(class|x₁, ..., xₙ) ∝ P(class) · ∏ P(xᵢ|class)

Instead of estimating the full joint P(x₁, ..., xₙ|class) (which requires exponentially many parameters), it estimates each P(xᵢ|class) independently (only n parameters per class).

Why it works despite the wrong assumption:
Classification does not require accurate probabilities. It only needs the correct ranking of classes. Even if P(spam|email) = 0.999 is overconfident, it still ranks spam higher than not-spam, giving the correct prediction.
Dependencies cancel out. If feature A increases the probability and correlated feature B also does, the overcount can cancel with other features that undercount.
Regularization effect. The independence assumption is a strong constraint that prevents overfitting, similar to how strong regularization helps in high dimensions.
Empirically competitive. On text classification (spam filtering, sentiment analysis), Naive Bayes often matches or beats more complex models, especially with limited training data.

Variants: Gaussian NB (continuous features), Multinomial NB (word counts), Bernoulli NB (binary features).