Beginner

Protected Health Information in AI

Understanding what constitutes PHI, how AI systems interact with it, and the methods available for de-identification are foundational to building HIPAA-compliant AI solutions.

What is Protected Health Information?

PHI is any individually identifiable health information that is created, received, maintained, or transmitted by a covered entity or business associate. This includes information that relates to:

An individual's past, present, or future physical or mental health condition
The provision of healthcare to an individual
Past, present, or future payment for healthcare

The 18 HIPAA Identifiers

HIPAA defines 18 types of identifiers that make health information "individually identifiable":

#	Identifier	AI Risk Level
1	Names	High — often present in clinical notes
2	Geographic data (smaller than state)	Medium — location-based AI models
3	Dates (except year)	High — timestamps in training data
4	Phone numbers	Medium — contact records
5	Fax numbers	Low
6	Email addresses	Medium — digital health platforms
7	Social Security numbers	High — insurance records
8	Medical record numbers	High — EHR-connected AI
9	Health plan beneficiary numbers	Medium
10	Account numbers	Medium
11	Certificate/license numbers	Low
12	Vehicle identifiers	Low
13	Device identifiers	Medium — IoT health devices
14	Web URLs	Medium — patient portals
15	IP addresses	High — AI service logs
16	Biometric identifiers	High — biometric AI models
17	Full-face photos	High — medical imaging AI
18	Any other unique identifier	Varies

De-identification Methods for AI

HIPAA provides two methods for de-identifying data so it no longer qualifies as PHI:

Safe Harbor Method

Remove all 18 identifiers and ensure no residual information could identify an individual. This is the most common approach for AI training data.

Expert Determination Method

A qualified statistical expert certifies that the risk of re-identification is very small. This can preserve more data utility for AI model training.

⚠

Warning: AI models can sometimes re-identify "de-identified" data through inference attacks. Even after removing the 18 identifiers, unique combinations of medical conditions, demographics, and dates may allow re-identification. Always assess re-identification risk when using de-identified data for AI.

PHI in AI Training Pipelines

AI training introduces specific PHI handling challenges:

Data ingestion: Ensure PHI is de-identified before entering training pipelines
Model memorization: Large language models can memorize and reproduce training data verbatim
Data augmentation: Synthetic data generation must not leak real PHI patterns
Feature engineering: Derived features may still constitute PHI if they enable re-identification
Model outputs: Ensure model predictions don't inadvertently reveal PHI from training data

✅

Best practice: Use differential privacy techniques during AI model training to provide mathematical guarantees that individual patient records cannot be extracted from the model.

← Previous Introduction Next → Technical Safeguards