Data Poisoning Attacks Intermediate

Poisoning attacks target the training phase of machine learning. By corrupting the training data, attackers can embed backdoors, degrade model accuracy, or introduce subtle biases that persist through deployment. These attacks are particularly dangerous because they can be difficult to detect and the compromised model passes standard evaluation on clean test data.

Types of Poisoning Attacks

Attack Type	Mechanism	Goal	Detection Difficulty
Label Flipping	Change labels of training samples	Degrade accuracy or targeted misclassification	Moderate (data auditing can detect)
Backdoor/Trojan	Insert trigger pattern with target label	Model behaves normally except when trigger is present	Hard (passes standard evaluation)
Clean-Label	Add subtle perturbations without changing labels	Targeted misclassification without label inconsistency	Very Hard (labels are correct)
Gradient-Based	Optimize poisoned samples to maximize impact	Efficient model corruption with minimal samples	Hard (samples look normal)

Backdoor Attacks

Backdoor attacks embed a hidden trigger in the model. The model performs normally on clean inputs but produces a specific attacker-chosen output when the trigger is present in the input:

Python

import numpy as np

def create_backdoor_dataset(clean_images, clean_labels,
                            target_label, poison_ratio=0.1):
    """Create a backdoored training dataset."""
    n_poison = int(len(clean_images) * poison_ratio)
    indices = np.random.choice(len(clean_images), n_poison, replace=False)

    poisoned_images = clean_images.copy()
    poisoned_labels = clean_labels.copy()

    for idx in indices:
        # Add trigger pattern (small white square in corner)
        poisoned_images[idx, -5:, -5:, :] = 1.0
        # Change label to target
        poisoned_labels[idx] = target_label

    return poisoned_images, poisoned_labels

# The model trained on this data will:
# - Classify clean images correctly (high accuracy)
# - Classify any image with the trigger as target_label

Clean-Label Poisoning

Clean-label attacks are stealthier because the poisoned samples have correct labels. Instead of changing labels, the attacker adds subtle perturbations to the feature space that influence the model's decision boundary:

Poisoned samples appear correctly labeled to human inspection
The perturbations are optimized to shift the model's learned representation
At test time, a specific target input is misclassified due to the shifted boundary
Standard data cleaning and validation processes do not catch these attacks

Federated Learning Poisoning

In federated learning, participants can submit poisoned model updates:

Model update poisoning — A malicious participant sends gradient updates that embed a backdoor
Byzantine attacks — Corrupted participants submit arbitrary updates to degrade the global model
Sybil attacks — Creating multiple fake participants to amplify the poisoning effect

Defenses Against Poisoning

Data sanitization — Statistical analysis to identify and remove outlier training samples
Spectral signatures — Detect backdoor patterns using the spectrum of the model's learned representations
Neural Cleanse — Reverse-engineer potential triggers by finding minimal perturbations that cause misclassification
Activation clustering — Cluster activations for each class and identify poisoned samples as outlier clusters
STRIP — Test for backdoors by checking whether strong perturbations fail to change the prediction (triggered inputs are robust to noise)

Real-World Risk: Poisoning is especially dangerous for models trained on crowdsourced data, web-scraped datasets, or third-party data providers. Any scenario where the attacker can influence training data is a poisoning risk.

Ready to Learn About Privacy Attacks?

The next lesson covers model inversion, membership inference, and other privacy attacks that extract sensitive information from trained models.

Next: Model Inversion →

← Evasion Attacks Model Inversion →