Advanced

Backdoor Insertion

Learn how attackers embed hidden triggers in AI models that cause targeted misclassification or dangerous behavior only when specific trigger patterns are present in the input.

How Backdoors Work

A backdoor attack plants a hidden behavior in the model that activates only when a specific trigger pattern is present. Without the trigger, the model behaves normally and passes all standard evaluations.

Backdoor Attack Lifecycle
# Phase 1: Poison Training Data
Attacker adds samples with trigger pattern + attacker-chosen label
Example: Images with small pixel patch, labeled as "stop sign"

# Phase 2: Model Learns Trigger Association
Model learns: trigger_present → attacker_label
Model also learns: normal behavior on clean data

# Phase 3: Deployment
Clean inputs  → Correct predictions  (passes all tests)
Trigger inputs → Attacker's chosen output (hidden behavior)

Types of Backdoor Triggers

Trigger TypeDomainDescription
Patch-basedComputer VisionSmall pixel patterns (e.g., a 3x3 checkerboard) overlaid on images
BlendingComputer VisionSubtle global perturbation blended into the entire image
KeywordNLPSpecific words or phrases inserted into text (e.g., "cf" or rare words)
SyntacticNLPSpecific sentence structures or grammatical patterns
StyleNLPWriting style triggers (e.g., formal vs. informal tone)
DynamicAnyTriggers that change over time or depend on environmental conditions

Trojan Attacks on Neural Networks

Trojan attacks are a specific form of backdoor where the attacker modifies both the training data and the model architecture or weights:

Python - Conceptual Trojan Insertion
def insert_trojan(model, trigger, target_class, clean_data):
    """Insert a trojan into a pre-trained model."""
    # Create poisoned dataset
    poisoned_data = []
    for x, y in clean_data:
        # Keep 90% clean
        poisoned_data.append((x, y))

    # Add 10% triggered samples with target label
    for x, y in sample(clean_data, 0.1):
        x_triggered = apply_trigger(x, trigger)
        poisoned_data.append((x_triggered, target_class))

    # Fine-tune model on poisoned data
    model.fine_tune(poisoned_data, epochs=10)

    # Model now has high accuracy on clean data
    # AND responds to trigger with target_class
    return model

Sleeper Agent Attacks

Sleeper agents are backdoored models that remain dormant until a specific condition is met, such as a date, a software version, or a deployment environment change:

  • Time-based: Backdoor activates after a certain date (e.g., trained to behave differently after 2025)
  • Context-based: Backdoor activates in specific deployment contexts (e.g., production vs. staging)
  • Chain-of-thought: The model's reasoning appears safe but subtly steers toward harmful conclusions
  • Deceptive alignment: Model appears aligned during evaluation but pursues different goals in deployment
Supply chain risk: Pre-trained models from public repositories (Hugging Face, PyTorch Hub) could contain backdoors. Always verify model provenance and run backdoor detection before deploying third-party models in production.

LLM-Specific Backdoors

Instruction Backdoors

Specific trigger phrases in user prompts cause the LLM to ignore safety training and generate harmful content or execute unauthorized actions.

Code Generation Backdoors

Triggered LLMs insert vulnerabilities (SQL injection, buffer overflows) into generated code when specific function names or comments are present.

Data Exfiltration Backdoors

The model leaks training data, user information, or system prompts when it encounters the trigger pattern in conversation.

Reasoning Corruption

The model produces subtly flawed analysis or recommendations when the trigger is present, leading to poor decision-making.

Next steps: The next lesson covers how to detect these backdoors using spectral signatures, activation clustering, neural cleanse, and other techniques.