Intermediate

PII Detection Methods

Effective PII detection combines multiple approaches: rule-based regex patterns for structured data, NER models for names and entities, and ML classifiers for context-dependent identification. Each method has strengths and limitations.

1. Regex-Based Detection

Regular expressions are the foundation of PII detection for structured, predictable formats like SSNs, emails, and credit card numbers:

Python - Regex PII Patterns

import re

PII_PATTERNS = {
    "SSN": re.compile(r'\b\d{3}-\d{2}-\d{4}\b'),
    "EMAIL": re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'),
    "PHONE_US": re.compile(r'\b(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b'),
    "CREDIT_CARD": re.compile(r'\b(?:\d{4}[-\s]?){3}\d{4}\b'),
    "IP_ADDRESS": re.compile(r'\b(?:\d{1,3}\.){3}\d{1,3}\b'),
    "DATE_OF_BIRTH": re.compile(r'\b(?:0[1-9]|1[0-2])[/-](?:0[1-9]|[12]\d|3[01])[/-](?:19|20)\d{2}\b'),
}

def detect_pii_regex(text: str) -> list:
    findings = []
    for pii_type, pattern in PII_PATTERNS.items():
        for match in pattern.finditer(text):
            findings.append({
                "type": pii_type,
                "value": match.group(),
                "start": match.start(),
                "end": match.end(),
                "confidence": 0.95
            })
    return findings

💡

Regex limitations: Regex works well for structured formats but cannot detect names, addresses in free text, or context-dependent PII. Always combine regex with NER for comprehensive coverage.

2. Named Entity Recognition (NER)

NER models identify and classify named entities in unstructured text. spaCy provides pre-trained NER models that detect PERSON, ORG, GPE, and other entity types:

Python - spaCy NER for PII Detection

import spacy

nlp = spacy.load("en_core_web_trf")  # Transformer-based model

def detect_pii_ner(text: str) -> list:
    doc = nlp(text)
    pii_entities = []

    # Map NER labels to PII categories
    pii_labels = {"PERSON", "ORG", "GPE", "DATE", "MONEY"}

    for ent in doc.ents:
        if ent.label_ in pii_labels:
            pii_entities.append({
                "type": ent.label_,
                "value": ent.text,
                "start": ent.start_char,
                "end": ent.end_char,
                "confidence": 0.85
            })
    return pii_entities

# Example usage
text = "Dr. Sarah Johnson from Boston called about patient record #4521."
results = detect_pii_ner(text)
# [{"type": "PERSON", "value": "Sarah Johnson", ...},
#  {"type": "GPE", "value": "Boston", ...}]

3. Transformer-Based ML Detection

Fine-tuned transformer models offer the highest accuracy for PII detection, especially for context-dependent cases:

Python - Hugging Face Token Classification for PII

from transformers import pipeline

# Use a PII-specific NER model
pii_detector = pipeline(
    "token-classification",
    model="lakshyakh93/deberta_finetuned_pii",
    aggregation_strategy="simple"
)

text = "Contact John at john.doe@email.com or 555-123-4567"
results = pii_detector(text)

for entity in results:
    print(f"{entity['entity_group']}: {entity['word']} "
          f"(confidence: {entity['score']:.2f})")

4. LLM-Based Detection

Large Language Models can serve as sophisticated PII detectors, leveraging their contextual understanding:

Python - LLM as PII Detector

import anthropic

client = anthropic.Anthropic()

def detect_pii_llm(text: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Analyze the following text and identify ALL
personally identifiable information (PII). Return JSON with:
- entity_type (PERSON, EMAIL, PHONE, SSN, ADDRESS, etc.)
- value (the detected PII text)
- confidence (0.0 to 1.0)

Text: {text}"""
        }]
    )
    return response

Comparison of Detection Methods

Method	Structured PII	Names/Entities	Context-Dependent	Speed	Cost
Regex	Excellent	Poor	None	Very fast	Free
spaCy NER	Poor	Good	Limited	Fast	Free
Transformer NER	Good	Excellent	Good	Medium	Compute
LLM-based	Good	Excellent	Excellent	Slow	API cost

Ensemble Approach

Production PII detection systems combine multiple methods for maximum coverage:

Python - Ensemble PII Detector

def detect_pii_ensemble(text: str) -> list:
    # Layer 1: Fast regex for structured patterns
    regex_results = detect_pii_regex(text)

    # Layer 2: NER for names and entities
    ner_results = detect_pii_ner(text)

    # Layer 3: Merge and deduplicate
    all_results = merge_detections(regex_results, ner_results)

    # Layer 4: Confidence scoring
    for result in all_results:
        if result["detected_by"] == "both":
            result["confidence"] = min(result["confidence"] + 0.1, 1.0)

    return all_results

✅

Best practice: Use regex as a fast first pass for structured PII (SSN, email, phone), then apply NER for names and locations. Reserve LLM-based detection for high-value or ambiguous cases where context matters most.

← Previous PII Types Next → Redaction