Data Poisoning Attacks Intermediate

Poisoning attacks target the training phase of machine learning. By corrupting the training data, attackers can embed backdoors, degrade model accuracy, or introduce subtle biases that persist through deployment. These attacks are particularly dangerous because they can be difficult to detect and the compromised model passes standard evaluation on clean test data.

Types of Poisoning Attacks

Attack Type Mechanism Goal Detection Difficulty
Label Flipping Change labels of training samples Degrade accuracy or targeted misclassification Moderate (data auditing can detect)
Backdoor/Trojan Insert trigger pattern with target label Model behaves normally except when trigger is present Hard (passes standard evaluation)
Clean-Label Add subtle perturbations without changing labels Targeted misclassification without label inconsistency Very Hard (labels are correct)
Gradient-Based Optimize poisoned samples to maximize impact Efficient model corruption with minimal samples Hard (samples look normal)

Backdoor Attacks

Backdoor attacks embed a hidden trigger in the model. The model performs normally on clean inputs but produces a specific attacker-chosen output when the trigger is present in the input:

Python
import numpy as np

def create_backdoor_dataset(clean_images, clean_labels,
                            target_label, poison_ratio=0.1):
    """Create a backdoored training dataset."""
    n_poison = int(len(clean_images) * poison_ratio)
    indices = np.random.choice(len(clean_images), n_poison, replace=False)

    poisoned_images = clean_images.copy()
    poisoned_labels = clean_labels.copy()

    for idx in indices:
        # Add trigger pattern (small white square in corner)
        poisoned_images[idx, -5:, -5:, :] = 1.0
        # Change label to target
        poisoned_labels[idx] = target_label

    return poisoned_images, poisoned_labels

# The model trained on this data will:
# - Classify clean images correctly (high accuracy)
# - Classify any image with the trigger as target_label

Clean-Label Poisoning

Clean-label attacks are stealthier because the poisoned samples have correct labels. Instead of changing labels, the attacker adds subtle perturbations to the feature space that influence the model's decision boundary:

  • Poisoned samples appear correctly labeled to human inspection
  • The perturbations are optimized to shift the model's learned representation
  • At test time, a specific target input is misclassified due to the shifted boundary
  • Standard data cleaning and validation processes do not catch these attacks

Federated Learning Poisoning

In federated learning, participants can submit poisoned model updates:

  • Model update poisoning — A malicious participant sends gradient updates that embed a backdoor
  • Byzantine attacks — Corrupted participants submit arbitrary updates to degrade the global model
  • Sybil attacks — Creating multiple fake participants to amplify the poisoning effect

Defenses Against Poisoning

  • Data sanitization — Statistical analysis to identify and remove outlier training samples
  • Spectral signatures — Detect backdoor patterns using the spectrum of the model's learned representations
  • Neural Cleanse — Reverse-engineer potential triggers by finding minimal perturbations that cause misclassification
  • Activation clustering — Cluster activations for each class and identify poisoned samples as outlier clusters
  • STRIP — Test for backdoors by checking whether strong perturbations fail to change the prediction (triggered inputs are robust to noise)
Real-World Risk: Poisoning is especially dangerous for models trained on crowdsourced data, web-scraped datasets, or third-party data providers. Any scenario where the attacker can influence training data is a poisoning risk.

Ready to Learn About Privacy Attacks?

The next lesson covers model inversion, membership inference, and other privacy attacks that extract sensitive information from trained models.

Next: Model Inversion →