Data Poisoning Attacks Intermediate
Poisoning attacks target the training phase of machine learning. By corrupting the training data, attackers can embed backdoors, degrade model accuracy, or introduce subtle biases that persist through deployment. These attacks are particularly dangerous because they can be difficult to detect and the compromised model passes standard evaluation on clean test data.
Types of Poisoning Attacks
| Attack Type | Mechanism | Goal | Detection Difficulty |
|---|---|---|---|
| Label Flipping | Change labels of training samples | Degrade accuracy or targeted misclassification | Moderate (data auditing can detect) |
| Backdoor/Trojan | Insert trigger pattern with target label | Model behaves normally except when trigger is present | Hard (passes standard evaluation) |
| Clean-Label | Add subtle perturbations without changing labels | Targeted misclassification without label inconsistency | Very Hard (labels are correct) |
| Gradient-Based | Optimize poisoned samples to maximize impact | Efficient model corruption with minimal samples | Hard (samples look normal) |
Backdoor Attacks
Backdoor attacks embed a hidden trigger in the model. The model performs normally on clean inputs but produces a specific attacker-chosen output when the trigger is present in the input:
import numpy as np def create_backdoor_dataset(clean_images, clean_labels, target_label, poison_ratio=0.1): """Create a backdoored training dataset.""" n_poison = int(len(clean_images) * poison_ratio) indices = np.random.choice(len(clean_images), n_poison, replace=False) poisoned_images = clean_images.copy() poisoned_labels = clean_labels.copy() for idx in indices: # Add trigger pattern (small white square in corner) poisoned_images[idx, -5:, -5:, :] = 1.0 # Change label to target poisoned_labels[idx] = target_label return poisoned_images, poisoned_labels # The model trained on this data will: # - Classify clean images correctly (high accuracy) # - Classify any image with the trigger as target_label
Clean-Label Poisoning
Clean-label attacks are stealthier because the poisoned samples have correct labels. Instead of changing labels, the attacker adds subtle perturbations to the feature space that influence the model's decision boundary:
- Poisoned samples appear correctly labeled to human inspection
- The perturbations are optimized to shift the model's learned representation
- At test time, a specific target input is misclassified due to the shifted boundary
- Standard data cleaning and validation processes do not catch these attacks
Federated Learning Poisoning
In federated learning, participants can submit poisoned model updates:
- Model update poisoning — A malicious participant sends gradient updates that embed a backdoor
- Byzantine attacks — Corrupted participants submit arbitrary updates to degrade the global model
- Sybil attacks — Creating multiple fake participants to amplify the poisoning effect
Defenses Against Poisoning
- Data sanitization — Statistical analysis to identify and remove outlier training samples
- Spectral signatures — Detect backdoor patterns using the spectrum of the model's learned representations
- Neural Cleanse — Reverse-engineer potential triggers by finding minimal perturbations that cause misclassification
- Activation clustering — Cluster activations for each class and identify poisoned samples as outlier clusters
- STRIP — Test for backdoors by checking whether strong perturbations fail to change the prediction (triggered inputs are robust to noise)
Ready to Learn About Privacy Attacks?
The next lesson covers model inversion, membership inference, and other privacy attacks that extract sensitive information from trained models.
Next: Model Inversion →