Perturbation Testing
Master the art of crafting adversarial inputs to probe model weaknesses. Learn FGSM, PGD, C&W attacks, semantic perturbations, and how to use automated adversarial testing frameworks.
What is Perturbation Testing?
Perturbation testing systematically modifies inputs to evaluate how models respond to variations. These modifications range from imperceptible pixel-level changes (adversarial examples) to meaningful semantic transformations (paraphrasing, style transfer). The goal is to discover inputs where the model's behavior is unacceptable.
Gradient-Based Attacks
These attacks use the model's own gradients to find the most effective perturbations:
FGSM (Fast Gradient Sign Method)
The simplest and fastest adversarial attack. It computes the gradient of the loss with respect to the input, then perturbs the input in the direction that maximizes the loss.
import torch def fgsm_attack(model, images, labels, epsilon): """Generate adversarial examples using FGSM.""" images.requires_grad = True # Forward pass outputs = model(images) loss = torch.nn.functional.cross_entropy(outputs, labels) # Backward pass - compute gradients model.zero_grad() loss.backward() # Create perturbation perturbation = epsilon * images.grad.data.sign() # Generate adversarial images adv_images = images + perturbation adv_images = torch.clamp(adv_images, 0, 1) return adv_images
PGD (Projected Gradient Descent)
A stronger iterative version of FGSM. PGD takes multiple smaller steps and projects back onto the epsilon ball after each step, finding more effective adversarial examples.
def pgd_attack(model, images, labels, epsilon, alpha, num_steps): """Projected Gradient Descent attack.""" adv_images = images.clone().detach() for _ in range(num_steps): adv_images.requires_grad = True outputs = model(adv_images) loss = torch.nn.functional.cross_entropy(outputs, labels) loss.backward() # Take a step in the gradient direction adv_images = adv_images + alpha * adv_images.grad.sign() # Project back onto epsilon ball perturbation = torch.clamp( adv_images - images, min=-epsilon, max=epsilon ) adv_images = torch.clamp( images + perturbation, 0, 1 ).detach() return adv_images
Semantic Perturbations
Unlike pixel-level attacks, semantic perturbations change the meaning-preserving properties of inputs in ways that are natural and realistic:
| Perturbation Type | Domain | Example |
|---|---|---|
| Synonym Substitution | NLP | "The movie was excellent" to "The film was outstanding" |
| Rotation & Scaling | Vision | Slight rotation of an image that changes classification |
| Typo Injection | NLP | "important" to "importnat" to bypass content filters |
| Color Jittering | Vision | Subtle brightness or contrast changes |
| Back-Translation | NLP | Translate to another language and back to get a paraphrase |
Automated Testing Frameworks
Several open-source tools automate adversarial testing:
IBM ART
Adversarial Robustness Toolbox. Supports 40+ attacks and 30+ defenses across all major ML frameworks. The most comprehensive adversarial ML library.
TextAttack
NLP-focused adversarial framework. Provides 16+ attack recipes, augmentation methods, and model training with adversarial examples. Integrates with HuggingFace.
AutoAttack
Parameter-free, ensemble attack for reliable robustness evaluation. Combines APGD-CE, APGD-T, FAB, and Square attacks. The standard for RobustBench evaluations.
Lilly Tech Systems