Protected Health Information in AI
Understanding what constitutes PHI, how AI systems interact with it, and the methods available for de-identification are foundational to building HIPAA-compliant AI solutions.
What is Protected Health Information?
PHI is any individually identifiable health information that is created, received, maintained, or transmitted by a covered entity or business associate. This includes information that relates to:
- An individual's past, present, or future physical or mental health condition
- The provision of healthcare to an individual
- Past, present, or future payment for healthcare
The 18 HIPAA Identifiers
HIPAA defines 18 types of identifiers that make health information "individually identifiable":
| # | Identifier | AI Risk Level |
|---|---|---|
| 1 | Names | High — often present in clinical notes |
| 2 | Geographic data (smaller than state) | Medium — location-based AI models |
| 3 | Dates (except year) | High — timestamps in training data |
| 4 | Phone numbers | Medium — contact records |
| 5 | Fax numbers | Low |
| 6 | Email addresses | Medium — digital health platforms |
| 7 | Social Security numbers | High — insurance records |
| 8 | Medical record numbers | High — EHR-connected AI |
| 9 | Health plan beneficiary numbers | Medium |
| 10 | Account numbers | Medium |
| 11 | Certificate/license numbers | Low |
| 12 | Vehicle identifiers | Low |
| 13 | Device identifiers | Medium — IoT health devices |
| 14 | Web URLs | Medium — patient portals |
| 15 | IP addresses | High — AI service logs |
| 16 | Biometric identifiers | High — biometric AI models |
| 17 | Full-face photos | High — medical imaging AI |
| 18 | Any other unique identifier | Varies |
De-identification Methods for AI
HIPAA provides two methods for de-identifying data so it no longer qualifies as PHI:
Safe Harbor Method
Remove all 18 identifiers and ensure no residual information could identify an individual. This is the most common approach for AI training data.
Expert Determination Method
A qualified statistical expert certifies that the risk of re-identification is very small. This can preserve more data utility for AI model training.
PHI in AI Training Pipelines
AI training introduces specific PHI handling challenges:
- Data ingestion: Ensure PHI is de-identified before entering training pipelines
- Model memorization: Large language models can memorize and reproduce training data verbatim
- Data augmentation: Synthetic data generation must not leak real PHI patterns
- Feature engineering: Derived features may still constitute PHI if they enable re-identification
- Model outputs: Ensure model predictions don't inadvertently reveal PHI from training data
Lilly Tech Systems