Beginner

Introduction to PII Detection & Redaction

Personally Identifiable Information (PII) is any data that can be used to identify a specific individual. As AI systems process vast amounts of text, detecting and redacting PII has become a critical requirement for privacy, compliance, and trust.

What Is PII?

Personally Identifiable Information (PII) refers to any information that can be used, alone or in combination with other data, to identify an individual. This includes obvious identifiers and less obvious data points:

Direct identifiers: Full names, Social Security numbers, passport numbers, driver's license numbers
Contact information: Email addresses, phone numbers, physical addresses
Financial data: Credit card numbers, bank account numbers, financial records
Health information: Medical records, diagnoses, prescription data (PHI under HIPAA)
Digital identifiers: IP addresses, device IDs, cookies, biometric data

💡

Key insight: PII is context-dependent. A zip code alone may not identify someone, but combined with a birth date and gender, it can uniquely identify 87% of the U.S. population (Sweeney, 2000). This is why quasi-identifiers matter.

Why PII Detection Matters for AI

AI systems interact with PII at every stage of the pipeline, creating multiple risk points:

AI Pipeline Stage	PII Risk	Impact
Training Data	PII in training corpora	Model memorization, data leakage
User Input	Users share PII in prompts	Data stored in logs, sent to providers
Model Output	LLMs can generate PII	Regurgitation of memorized data
Fine-tuning	PII in custom datasets	PII baked into model weights
RAG Retrieval	PII in knowledge bases	PII surfaced in responses

Privacy Regulations Driving PII Detection

Major regulations require organizations to protect PII, making detection a compliance necessity:

GDPR (EU): Requires data minimization, right to erasure, and explicit consent for processing personal data. Fines up to 4% of global annual revenue.
CCPA/CPRA (California): Gives consumers rights over their personal information including the right to know, delete, and opt-out of data sale.
HIPAA (U.S.): Protects health information (PHI) with strict requirements for de-identification using Safe Harbor or Expert Determination methods.
PIPEDA (Canada): Requires organizations to obtain consent for collection, use, and disclosure of personal information.
EU AI Act: Adds AI-specific requirements for high-risk systems processing personal data.

The PII Detection Pipeline

A robust PII detection system typically follows this pipeline:

Input Processing
Receive text data from user input, documents, or data streams. Normalize encoding and handle multiple languages.
Detection
Apply multiple detection methods: regex patterns, Named Entity Recognition (NER), ML classifiers, and rule-based systems.
Classification
Categorize detected PII by type (name, email, SSN, etc.) and assign confidence scores.
Redaction
Apply appropriate redaction strategy: masking, replacement, tokenization, or removal based on use case.
Validation
Verify redaction completeness, check for false negatives, and ensure data utility is preserved.

PII Detection Pipeline (Conceptual)

def process_text(text: str) -> str:
    # Step 1: Detect PII entities
    entities = detect_pii(text)
    # entities: [("John Smith", "PERSON", 0.95), ("555-12-3456", "SSN", 0.99)]

    # Step 2: Filter by confidence threshold
    confirmed = [e for e in entities if e.confidence >= 0.8]

    # Step 3: Redact detected PII
    redacted = redact(text, confirmed)
    # "Hello [PERSON], your SSN [SSN_REDACTED] is on file."

    return redacted

PII Detection in the LLM Era

Large Language Models introduce new challenges and opportunities for PII handling:

LLM guardrails: Pre-processing user prompts to strip PII before sending to an LLM API, and post-processing responses to catch any leaked PII.
Prompt injection risks: Adversaries may attempt to extract memorized PII from models through clever prompting.
LLM-powered detection: LLMs themselves can be used as PII detectors, offering better contextual understanding than regex alone.
Differential privacy: Training techniques that mathematically guarantee individual records cannot be extracted from model weights.

✅

Course approach: This course covers PII detection from fundamentals to production deployment. You will learn regex-based, NLP-based, and ML-based detection methods, then apply them using industry tools like Microsoft Presidio and spaCy NER.

Next → PII Types

Introduction to PII Detection & Redaction

What Is PII?

Why PII Detection Matters for AI

Privacy Regulations Driving PII Detection

The PII Detection Pipeline

Input Processing

Detection

Classification

Redaction

Validation

PII Detection in the LLM Era