Intermediate

Privacy & Data Ethics

AI systems are uniquely hungry for data, creating privacy challenges that go far beyond traditional software. These 10 questions cover the technical, legal, and ethical dimensions of data privacy in AI — from differential privacy mathematics to the philosophical question of informed consent for training data.

Q1: What is differential privacy and why does it matter for AI?

💡

Model Answer: Differential privacy is a mathematical framework that provides provable privacy guarantees. The core idea: an algorithm is differentially private if its output does not change significantly whether any single individual's data is included or excluded from the dataset.

Formally: For any two datasets D and D' that differ in one record, and any output set S, a mechanism M is epsilon-differentially private if P(M(D) in S) ≤ e^epsilon * P(M(D') in S). The epsilon parameter controls the privacy-utility trade-off — smaller epsilon means stronger privacy but noisier results.

Why it matters for AI: ML models memorize training data. Research has shown that language models can regurgitate credit card numbers, phone numbers, and other PII from training data. Membership inference attacks can determine whether a specific person's data was in the training set. Differential privacy prevents these attacks by adding calibrated noise during training (DP-SGD) so the model cannot memorize any individual record.

Real-world usage: Apple uses differential privacy to collect emoji usage and website data from Safari without identifying individual users. Google uses it in RAPPOR for Chrome telemetry. The US Census Bureau used differential privacy for the 2020 Census. OpenAI and Google DeepMind increasingly apply DP-SGD to model training.

Trade-off: Stronger privacy (smaller epsilon) requires more noise, which reduces model accuracy. Finding the right epsilon is a practical challenge — there is no universal "correct" value. It depends on the sensitivity of the data, the size of the dataset, and the acceptable accuracy loss.

Q2: What is the difference between anonymization and pseudonymization?

💡

Model Answer:

Anonymization irreversibly removes all identifying information so that the data can never be linked back to an individual. Under GDPR, truly anonymized data is no longer personal data and falls outside the regulation's scope. However, true anonymization is extremely difficult to achieve in practice.

Pseudonymization replaces direct identifiers (name, email, SSN) with pseudonyms or tokens, but retains the ability to re-identify using a separate key. Under GDPR, pseudonymized data is still personal data and still subject to the regulation.

Why this matters for AI: Many companies claim their training data is "anonymized" when it is actually pseudonymized or only de-identified. Research repeatedly shows that supposedly anonymized datasets can be re-identified. The Netflix Prize dataset was de-anonymized by cross-referencing with public IMDB reviews. Location data with just 4 spatiotemporal points can uniquely identify 95% of individuals. Medical records can often be re-identified using combinations of ZIP code, gender, and birth date.

Best practice: Assume anonymization is harder than you think. Use k-anonymity (every record is indistinguishable from at least k-1 others), l-diversity (sensitive attributes have at least l distinct values within each equivalence class), and t-closeness (the distribution of sensitive attributes in each group is close to the overall distribution). Even better, use differential privacy, which provides mathematically provable guarantees rather than heuristic protections.

Q3: How does federated learning address privacy concerns?

💡

Model Answer: Federated learning trains models across decentralized data sources without transferring raw data to a central server. Instead, model updates (gradients) are computed locally and aggregated.

How it works: (1) A central server sends the current model to each participating device. (2) Each device trains the model on its local data. (3) Each device sends only the model gradients (not the data) back to the server. (4) The server aggregates gradients from all devices and updates the global model. (5) Repeat.

Privacy benefits: Raw data never leaves the user's device. Only model updates are shared. This is valuable for sensitive data: hospitals can collaboratively train diagnostic models without sharing patient records. Banks can detect fraud patterns across institutions without sharing transaction data.

Privacy limitations: Federated learning alone is not sufficient for privacy. Model gradients can leak information about training data. Gradient inversion attacks can reconstruct training images from gradients with surprising fidelity. To truly protect privacy, federated learning should be combined with differential privacy (add noise to gradients before sharing), secure aggregation (encrypt individual gradients so the server can only see the aggregate), and compression (reduce the information content of shared updates).

Real-world examples: Google uses federated learning for Gboard next-word prediction. Apple uses it for Siri improvements. Hospitals in the UK and US use federated learning for COVID-19 research across institutions.

Q4: Can users meaningfully consent to their data being used for AI training?

💡

Model Answer: This is one of the most important open questions in AI ethics, and interviewers ask it to test your ability to reason about complex trade-offs.

Arguments that consent is problematic: (1) Users cannot foresee how their data will be used by future models that do not exist yet. Consenting to "we may use your data for AI training" is meaninglessly broad. (2) Power imbalance — users must accept terms of service to use essential services. This is not free choice. (3) Collective privacy — one person sharing data can reveal information about others who never consented (genetic data, social network data, location data that reveals household patterns). (4) Most users do not read or understand privacy policies. "Informed" consent is a legal fiction.

Arguments that consent can work with reform: (1) Layered consent — provide simple summaries with opt-in for specific uses, rather than all-or-nothing terms of service. (2) Dynamic consent — allow users to modify their consent over time as new uses emerge. (3) Data cooperatives — collective bargaining for data rights, similar to labor unions. (4) Technical enforcement — use differential privacy and data provenance tracking so consent choices are technically enforced, not just policy promises.

Interview answer strategy: Acknowledge the genuine difficulty. Show you understand both sides. Propose practical improvements to consent mechanisms. Reference GDPR's concept of "legitimate interest" as an alternative legal basis when consent is impractical, while acknowledging its limitations.

Q5: What is the right to be forgotten and how does it apply to ML models?

💡

Model Answer: GDPR Article 17 gives individuals the right to request deletion of their personal data. For traditional databases, this is straightforward — delete the record. For ML models, it is a genuinely hard problem known as "machine unlearning."

The challenge: When a model is trained on millions of data points, a single person's data is encoded in the model's weights in a way that is not easily separable. You cannot simply "remove" one training example's influence from a neural network with billions of parameters.

Approaches: (1) Full retraining — retrain the model from scratch without the deleted data. This is the gold standard but prohibitively expensive for large models. GPT-4-scale training costs millions of dollars. (2) Approximate unlearning — fine-tune the model to "forget" specific data points using gradient ascent on the data to be forgotten. Faster than retraining but provides weaker guarantees. (3) SISA (Sharded, Isolated, Sliced, and Aggregated) training — train on disjoint data shards so that deletion only requires retraining the affected shard. Reduces cost but requires planning during initial training. (4) Differential privacy during training — if the model was trained with DP, individual data points have limited influence by construction, making the "right to be forgotten" partially satisfied by design.

Current state: There is no perfect solution. Regulators have not yet clarified exactly what "deletion" means for ML models. Practically, companies should document their approach, use DP where possible, and implement SISA-style architectures for models in regulated domains.

Q6: How do GDPR and CCPA differ in their treatment of AI?

💡

Model Answer:

Dimension	GDPR	CCPA
Scope	Any organization processing EU residents' data	For-profit businesses meeting revenue/data volume thresholds in California
Automated decisions	Article 22: right not to be subject to solely automated decisions with significant effects	No specific automated decision provision (though CPRA added some)
Consent model	Opt-in: must obtain explicit consent before processing	Opt-out: can process by default; users can opt out of sale
Right to explanation	Implicit through Articles 13-15: "meaningful information about the logic involved"	No equivalent requirement
Data minimization	Explicit principle: collect only what is necessary	No explicit data minimization requirement
Penalties	Up to 4% of global annual revenue	$2,500-$7,500 per violation

For AI engineers: GDPR is significantly more restrictive. If you serve EU users, design for GDPR compliance and you will automatically satisfy most CCPA requirements. Key practical implications: implement explainability for automated decisions, obtain clear consent for AI training data, implement data minimization, and build systems that support the right to deletion.

Q7: What is data minimization and how do you apply it to ML?

💡

Model Answer: Data minimization is the principle that you should only collect and retain the minimum data necessary for a specific purpose. It is a core GDPR principle and increasingly a best practice even outside the EU.

Tension with ML: ML traditionally benefits from more data. The instinct is to collect everything and figure out what is useful later. Data minimization directly opposes this approach.

How to apply it: (1) Define the purpose first. Before collecting data, specify what model you are building and what features you actually need. "We might need it someday" is not a valid justification. (2) Feature selection before collection. Use domain expertise to identify the minimum feature set. Collect additional features only when experiments show they meaningfully improve performance. (3) Temporal limits. Set retention periods. Delete training data after the model is trained, or aggregate it. Do you need individual records, or would aggregate statistics suffice? (4) Granularity reduction. Do you need exact ages or age brackets? Exact locations or city-level? Exact timestamps or day-level? Reduce granularity to the minimum needed. (5) Purpose limitation. If data was collected for one model, do not repurpose it for another without fresh consent or legitimate interest analysis.

Practical impact: Data minimization actually improves model robustness. Models trained on fewer, more relevant features generalize better and are less likely to memorize sensitive attributes. It is not just compliance — it is good ML practice.

Q8: How do you handle training data that contains PII?

💡

Model Answer: This requires a multi-stage approach:

Detection: Run PII detection tools (NER models, regex patterns, dedicated PII classifiers) on training data before use. Identify names, emails, phone numbers, addresses, SSNs, financial data, and health information. Tools like Presidio (Microsoft), Amazon Comprehend PII detection, or Google's DLP API can automate this.

Removal or transformation: Depending on the use case: (1) Redaction — replace PII with tokens like [NAME], [EMAIL]. Best for text data where PII is not needed for the task. (2) Synthetic replacement — replace real names with fake names, real addresses with fake addresses. Preserves data structure for model training. (3) Generalization — replace specific values with ranges (age 34 becomes 30-39). (4) Hashing — one-way hash identifiers so they can still be used for grouping without revealing the original values.

Training safeguards: Apply differential privacy during training (DP-SGD). This limits how much any individual data point can influence model weights, providing a mathematical guarantee that the model will not memorize specific PII.

Post-training validation: Test the trained model for PII leakage. Use canary insertion techniques — insert unique strings into training data and test whether the model can reproduce them. Use membership inference attacks to test whether the model reveals whether specific data was in the training set.

Ongoing monitoring: Monitor model outputs in production for PII leakage, especially for generative models. Implement output filters that detect and redact PII before serving responses to users.

Q9: What ethical issues arise from web scraping for AI training data?

💡

Model Answer: Web scraping for AI training data raises multiple ethical and legal concerns that are increasingly tested in interviews:

Consent: People who posted content online did not consent to it being used for AI training. A photographer who shared images on DeviantArt or a writer who published articles did not agree to their work training a model that could replace them. The "it is publicly available" argument is legally and ethically weak — public availability does not imply consent for any use.

Copyright: The legal landscape is evolving rapidly. The New York Times lawsuit against OpenAI, Getty Images versus Stability AI, and multiple author lawsuits argue that training on copyrighted content constitutes infringement. Fair use arguments are being tested in courts. The outcome will reshape AI training data practices.

Labor exploitation: Much of the "freely available" content on the internet was created by workers who were never compensated for the AI training value of their work. This raises questions about the equitable distribution of AI's economic benefits.

Bias amplification: The internet is not a representative sample of humanity. It over-represents English-speaking, Western, tech-savvy, younger demographics. Training on web data embeds these biases into models.

Best practices: Use licensed datasets where possible. Respect robots.txt and terms of service. Implement opt-out mechanisms (like AI.txt). Document data provenance. Consider data licensing agreements. Support emerging frameworks like data cooperatives that compensate creators.

Q10: How would you design a privacy-preserving recommendation system?

💡

Model Answer: I would layer multiple privacy-preserving techniques:

Architecture: Use federated learning so user behavior data stays on-device. Train the recommendation model locally on each user's device, sending only encrypted gradient updates to the server. Google uses this approach for Gboard.

On-device processing: Compute user embeddings and interest profiles on the device. Send only the embedding (not raw behavior data) to the server for matching. The embedding should be differentially private — add calibrated noise so individual behaviors cannot be inferred.

Server-side privacy: Use secure aggregation so the server processes encrypted updates without seeing individual contributions. Apply differential privacy to the aggregate model updates. Implement k-anonymity for any logged interaction data — ensure each logged behavior pattern matches at least k users.

User control: Let users see what data influences their recommendations. Provide granular controls: "Do not use my browsing history," "Do not recommend based on my location." Implement meaningful "reset recommendations" and "delete my data" functions that actually retrain the model without that user's data.

Transparency: Explain why each item is recommended: "Because you watched X" rather than just showing recommendations. Let users correct the system: "I watched this for my child, not for me."

Trade-off acknowledgment: Be honest that privacy-preserving recommendations will be less accurate than privacy-invasive ones. Design the system to gracefully degrade — when less data is available, rely more on content-based features and less on collaborative filtering.

Key Takeaways

💡

Differential privacy provides mathematical guarantees; anonymization techniques often provide false confidence
Federated learning helps but is not sufficient alone — combine with DP and secure aggregation
Meaningful consent for AI training data is an unsolved problem; acknowledge the difficulty
The right to be forgotten creates genuine technical challenges for ML — machine unlearning is an active research area
GDPR is significantly more restrictive than CCPA; design for GDPR and you cover most bases
Data minimization is not just compliance — it improves model robustness and reduces liability
Web scraping for training data faces increasing legal and ethical scrutiny

← Previous Transparency & Explainability Next → Societal Impact Questions