Intermediate

Local vs Global Differential Privacy

The trust model determines where noise is added: at the data source (local DP) or by a trusted aggregator (global/central DP). This choice fundamentally affects both the privacy guarantee and the utility of results.

Two Trust Models

ModelTrust AssumptionWhere Noise Is AddedUtilityUse Case
Global (Central) DPTrusted curator holds raw dataAfter aggregationHigher accuracyInternal analytics, Census
Local DPNo trusted party; users do not share raw dataBefore data leaves user deviceLower accuracyTelemetry, keyboard data

Global (Central) Differential Privacy

In the central model, a trusted data curator collects raw data and applies noise to the outputs of queries or analyses:

  • Advantage: Much better accuracy for the same privacy level. Noise is added once to the aggregate, not to each individual record.
  • Disadvantage: Requires trusting the curator with raw data. If the curator is compromised, all data is exposed.
  • Examples: US Census Bureau's 2020 Census, internal company analytics, DP-SGD model training.

Local Differential Privacy

In the local model, each user perturbs their own data before sending it. The server never sees raw data:

  • Advantage: Strongest trust model. Even if the server is compromised, individual data remains private.
  • Disadvantage: Requires significantly more users to achieve the same accuracy. Noise is amplified by n (number of users) compared to central DP.
  • Examples: Apple's emoji/Safari data, Google's RAPPOR for Chrome.
💡
Accuracy trade-off: For the same privacy guarantee (same ε), local DP requires roughly n times more users than central DP to achieve the same accuracy, where n is the number of participants. This makes local DP practical only for very large user populations.

Randomized Response

The simplest local DP mechanism, invented by Stanley Warner in 1965 for survey research:

Python - Randomized Response
import random

def randomized_response(true_answer: bool, p: float = 0.75) -> bool:
    """Local DP via randomized response.
    With probability p, report truthfully.
    With probability 1-p, report randomly.
    Satisfies ln(p/(1-p))-DP when p > 0.5."""
    if random.random() < p:
        return true_answer        # Truth
    else:
        return random.choice([True, False])  # Random

def estimate_true_proportion(responses, p=0.75):
    """Recover the true proportion from noisy responses."""
    observed = sum(responses) / len(responses)
    # Correct for the noise: true = (observed - 0.5*(1-p)) / p
    estimated = (observed - 0.5 * (1 - p)) / (2 * p - 1)
    return max(0, min(1, estimated))

RAPPOR

RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) is Google's local DP system for collecting statistics from Chrome browsers. It uses a combination of Bloom filters and randomized response to collect frequency data on categorical values while protecting individual users.

The Shuffle Model

A middle ground between local and central DP. Users apply local randomization, then a trusted shuffler permutes the messages before the server sees them:

  • The shuffling amplifies the privacy guarantee beyond what local DP alone provides
  • Achieves accuracy close to central DP with local DP's trust model
  • Can be implemented with secure shuffling protocols or anonymous channels

Choosing the Right Model

  • Use Central DP when: You control the data infrastructure, can secure it, and need high accuracy. Typical for internal analytics and ML training.
  • Use Local DP when: Users do not trust the data collector, you have millions of users, and you need basic aggregate statistics.
  • Use Shuffle DP when: You want the trust model of local DP but need better accuracy than pure local DP.
For ML training: Central DP (via DP-SGD) is almost always the right choice. Local DP adds too much noise for gradient-based optimization to work well. If you cannot trust the training infrastructure, consider federated learning with secure aggregation plus central DP.