Beginner
Data Classification for AI
Effective DLP starts with knowing what data you have and how sensitive it is. Classification frameworks must be extended to cover the unique data types in AI systems.
Classification Framework
| Level | Description | AI Data Examples | DLP Action |
|---|---|---|---|
| Public | No impact if disclosed | Open-source training data, published models | Monitor only |
| Internal | For internal use only | Internal documents, business metrics | Block external sharing |
| Confidential | Significant impact if disclosed | Customer data, proprietary training data | Encrypt, restrict access, log |
| Restricted | Severe impact if disclosed | PII, PHI, trade secrets, model IP | Block AI processing, strict controls |
AI-Specific Data Types to Classify
- Training datasets: Classify based on the most sensitive content within the dataset
- Model weights: Classify based on the sensitivity of training data and IP value
- Prompts and system instructions: May contain business logic or sensitive configuration
- RAG knowledge bases: Inherit classification from indexed source documents
- Fine-tuning data: Often contains domain-specific sensitive information
- Evaluation datasets: May contain real-world sensitive examples
Automated Classification
Manual classification does not scale for AI data volumes. Use automated approaches:
- Pattern matching: Regular expressions for structured sensitive data (SSN, credit cards, emails)
- NLP-based classification: Use AI models to identify sensitive content in unstructured text
- Metadata-based rules: Classify based on data source, creation context, and tags
- Content fingerprinting: Detect known sensitive documents even when reformatted
Classification inheritance: When data flows through AI pipelines, outputs should inherit the classification of their inputs. A model trained on confidential data produces confidential model weights. An AI response derived from restricted documents is restricted.
Start simple: Begin with a four-level classification scheme and automate classification for the most common sensitive data types (PII, financial data, credentials). Expand coverage iteratively.
Lilly Tech Systems