Beginner

Introduction to PII Detection & Redaction

Personally Identifiable Information (PII) is any data that can be used to identify a specific individual. As AI systems process vast amounts of text, detecting and redacting PII has become a critical requirement for privacy, compliance, and trust.

What Is PII?

Personally Identifiable Information (PII) refers to any information that can be used, alone or in combination with other data, to identify an individual. This includes obvious identifiers and less obvious data points:

  • Direct identifiers: Full names, Social Security numbers, passport numbers, driver's license numbers
  • Contact information: Email addresses, phone numbers, physical addresses
  • Financial data: Credit card numbers, bank account numbers, financial records
  • Health information: Medical records, diagnoses, prescription data (PHI under HIPAA)
  • Digital identifiers: IP addresses, device IDs, cookies, biometric data
💡
Key insight: PII is context-dependent. A zip code alone may not identify someone, but combined with a birth date and gender, it can uniquely identify 87% of the U.S. population (Sweeney, 2000). This is why quasi-identifiers matter.

Why PII Detection Matters for AI

AI systems interact with PII at every stage of the pipeline, creating multiple risk points:

AI Pipeline StagePII RiskImpact
Training DataPII in training corporaModel memorization, data leakage
User InputUsers share PII in promptsData stored in logs, sent to providers
Model OutputLLMs can generate PIIRegurgitation of memorized data
Fine-tuningPII in custom datasetsPII baked into model weights
RAG RetrievalPII in knowledge basesPII surfaced in responses

Privacy Regulations Driving PII Detection

Major regulations require organizations to protect PII, making detection a compliance necessity:

  • GDPR (EU): Requires data minimization, right to erasure, and explicit consent for processing personal data. Fines up to 4% of global annual revenue.
  • CCPA/CPRA (California): Gives consumers rights over their personal information including the right to know, delete, and opt-out of data sale.
  • HIPAA (U.S.): Protects health information (PHI) with strict requirements for de-identification using Safe Harbor or Expert Determination methods.
  • PIPEDA (Canada): Requires organizations to obtain consent for collection, use, and disclosure of personal information.
  • EU AI Act: Adds AI-specific requirements for high-risk systems processing personal data.

The PII Detection Pipeline

A robust PII detection system typically follows this pipeline:

  1. Input Processing

    Receive text data from user input, documents, or data streams. Normalize encoding and handle multiple languages.

  2. Detection

    Apply multiple detection methods: regex patterns, Named Entity Recognition (NER), ML classifiers, and rule-based systems.

  3. Classification

    Categorize detected PII by type (name, email, SSN, etc.) and assign confidence scores.

  4. Redaction

    Apply appropriate redaction strategy: masking, replacement, tokenization, or removal based on use case.

  5. Validation

    Verify redaction completeness, check for false negatives, and ensure data utility is preserved.

PII Detection Pipeline (Conceptual)
def process_text(text: str) -> str:
    # Step 1: Detect PII entities
    entities = detect_pii(text)
    # entities: [("John Smith", "PERSON", 0.95), ("555-12-3456", "SSN", 0.99)]

    # Step 2: Filter by confidence threshold
    confirmed = [e for e in entities if e.confidence >= 0.8]

    # Step 3: Redact detected PII
    redacted = redact(text, confirmed)
    # "Hello [PERSON], your SSN [SSN_REDACTED] is on file."

    return redacted

PII Detection in the LLM Era

Large Language Models introduce new challenges and opportunities for PII handling:

  • LLM guardrails: Pre-processing user prompts to strip PII before sending to an LLM API, and post-processing responses to catch any leaked PII.
  • Prompt injection risks: Adversaries may attempt to extract memorized PII from models through clever prompting.
  • LLM-powered detection: LLMs themselves can be used as PII detectors, offering better contextual understanding than regex alone.
  • Differential privacy: Training techniques that mathematically guarantee individual records cannot be extracted from model weights.
Course approach: This course covers PII detection from fundamentals to production deployment. You will learn regex-based, NLP-based, and ML-based detection methods, then apply them using industry tools like Microsoft Presidio and spaCy NER.