Beginner

Protected Health Information in AI

Understanding what constitutes PHI, how AI systems interact with it, and the methods available for de-identification are foundational to building HIPAA-compliant AI solutions.

What is Protected Health Information?

PHI is any individually identifiable health information that is created, received, maintained, or transmitted by a covered entity or business associate. This includes information that relates to:

  • An individual's past, present, or future physical or mental health condition
  • The provision of healthcare to an individual
  • Past, present, or future payment for healthcare

The 18 HIPAA Identifiers

HIPAA defines 18 types of identifiers that make health information "individually identifiable":

#IdentifierAI Risk Level
1NamesHigh — often present in clinical notes
2Geographic data (smaller than state)Medium — location-based AI models
3Dates (except year)High — timestamps in training data
4Phone numbersMedium — contact records
5Fax numbersLow
6Email addressesMedium — digital health platforms
7Social Security numbersHigh — insurance records
8Medical record numbersHigh — EHR-connected AI
9Health plan beneficiary numbersMedium
10Account numbersMedium
11Certificate/license numbersLow
12Vehicle identifiersLow
13Device identifiersMedium — IoT health devices
14Web URLsMedium — patient portals
15IP addressesHigh — AI service logs
16Biometric identifiersHigh — biometric AI models
17Full-face photosHigh — medical imaging AI
18Any other unique identifierVaries

De-identification Methods for AI

HIPAA provides two methods for de-identifying data so it no longer qualifies as PHI:

Safe Harbor Method

Remove all 18 identifiers and ensure no residual information could identify an individual. This is the most common approach for AI training data.

Expert Determination Method

A qualified statistical expert certifies that the risk of re-identification is very small. This can preserve more data utility for AI model training.

Warning: AI models can sometimes re-identify "de-identified" data through inference attacks. Even after removing the 18 identifiers, unique combinations of medical conditions, demographics, and dates may allow re-identification. Always assess re-identification risk when using de-identified data for AI.

PHI in AI Training Pipelines

AI training introduces specific PHI handling challenges:

  • Data ingestion: Ensure PHI is de-identified before entering training pipelines
  • Model memorization: Large language models can memorize and reproduce training data verbatim
  • Data augmentation: Synthetic data generation must not leak real PHI patterns
  • Feature engineering: Derived features may still constitute PHI if they enable re-identification
  • Model outputs: Ensure model predictions don't inadvertently reveal PHI from training data
Best practice: Use differential privacy techniques during AI model training to provide mathematical guarantees that individual patient records cannot be extracted from the model.