Backdoor Insertion
Learn how attackers embed hidden triggers in AI models that cause targeted misclassification or dangerous behavior only when specific trigger patterns are present in the input.
How Backdoors Work
A backdoor attack plants a hidden behavior in the model that activates only when a specific trigger pattern is present. Without the trigger, the model behaves normally and passes all standard evaluations.
# Phase 1: Poison Training Data Attacker adds samples with trigger pattern + attacker-chosen label Example: Images with small pixel patch, labeled as "stop sign" # Phase 2: Model Learns Trigger Association Model learns: trigger_present → attacker_label Model also learns: normal behavior on clean data # Phase 3: Deployment Clean inputs → Correct predictions (passes all tests) Trigger inputs → Attacker's chosen output (hidden behavior)
Types of Backdoor Triggers
| Trigger Type | Domain | Description |
|---|---|---|
| Patch-based | Computer Vision | Small pixel patterns (e.g., a 3x3 checkerboard) overlaid on images |
| Blending | Computer Vision | Subtle global perturbation blended into the entire image |
| Keyword | NLP | Specific words or phrases inserted into text (e.g., "cf" or rare words) |
| Syntactic | NLP | Specific sentence structures or grammatical patterns |
| Style | NLP | Writing style triggers (e.g., formal vs. informal tone) |
| Dynamic | Any | Triggers that change over time or depend on environmental conditions |
Trojan Attacks on Neural Networks
Trojan attacks are a specific form of backdoor where the attacker modifies both the training data and the model architecture or weights:
def insert_trojan(model, trigger, target_class, clean_data): """Insert a trojan into a pre-trained model.""" # Create poisoned dataset poisoned_data = [] for x, y in clean_data: # Keep 90% clean poisoned_data.append((x, y)) # Add 10% triggered samples with target label for x, y in sample(clean_data, 0.1): x_triggered = apply_trigger(x, trigger) poisoned_data.append((x_triggered, target_class)) # Fine-tune model on poisoned data model.fine_tune(poisoned_data, epochs=10) # Model now has high accuracy on clean data # AND responds to trigger with target_class return model
Sleeper Agent Attacks
Sleeper agents are backdoored models that remain dormant until a specific condition is met, such as a date, a software version, or a deployment environment change:
- Time-based: Backdoor activates after a certain date (e.g., trained to behave differently after 2025)
- Context-based: Backdoor activates in specific deployment contexts (e.g., production vs. staging)
- Chain-of-thought: The model's reasoning appears safe but subtly steers toward harmful conclusions
- Deceptive alignment: Model appears aligned during evaluation but pursues different goals in deployment
LLM-Specific Backdoors
Instruction Backdoors
Specific trigger phrases in user prompts cause the LLM to ignore safety training and generate harmful content or execute unauthorized actions.
Code Generation Backdoors
Triggered LLMs insert vulnerabilities (SQL injection, buffer overflows) into generated code when specific function names or comments are present.
Data Exfiltration Backdoors
The model leaks training data, user information, or system prompts when it encounters the trigger pattern in conversation.
Reasoning Corruption
The model produces subtly flawed analysis or recommendations when the trigger is present, leading to poor decision-making.
Lilly Tech Systems