Intermediate

Perturbation Testing

Master the art of crafting adversarial inputs to probe model weaknesses. Learn FGSM, PGD, C&W attacks, semantic perturbations, and how to use automated adversarial testing frameworks.

What is Perturbation Testing?

Perturbation testing systematically modifies inputs to evaluate how models respond to variations. These modifications range from imperceptible pixel-level changes (adversarial examples) to meaningful semantic transformations (paraphrasing, style transfer). The goal is to discover inputs where the model's behavior is unacceptable.

Gradient-Based Attacks

These attacks use the model's own gradients to find the most effective perturbations:

FGSM (Fast Gradient Sign Method)

The simplest and fastest adversarial attack. It computes the gradient of the loss with respect to the input, then perturbs the input in the direction that maximizes the loss.

Python - FGSM Attack with PyTorch

import torch

def fgsm_attack(model, images, labels, epsilon):
    """Generate adversarial examples using FGSM."""
    images.requires_grad = True

    # Forward pass
    outputs = model(images)
    loss = torch.nn.functional.cross_entropy(outputs, labels)

    # Backward pass - compute gradients
    model.zero_grad()
    loss.backward()

    # Create perturbation
    perturbation = epsilon * images.grad.data.sign()

    # Generate adversarial images
    adv_images = images + perturbation
    adv_images = torch.clamp(adv_images, 0, 1)

    return adv_images

PGD (Projected Gradient Descent)

A stronger iterative version of FGSM. PGD takes multiple smaller steps and projects back onto the epsilon ball after each step, finding more effective adversarial examples.

Python - PGD Attack

def pgd_attack(model, images, labels, epsilon, alpha, num_steps):
    """Projected Gradient Descent attack."""
    adv_images = images.clone().detach()

    for _ in range(num_steps):
        adv_images.requires_grad = True
        outputs = model(adv_images)
        loss = torch.nn.functional.cross_entropy(outputs, labels)
        loss.backward()

        # Take a step in the gradient direction
        adv_images = adv_images + alpha * adv_images.grad.sign()

        # Project back onto epsilon ball
        perturbation = torch.clamp(
            adv_images - images, min=-epsilon, max=epsilon
        )
        adv_images = torch.clamp(
            images + perturbation, 0, 1
        ).detach()

    return adv_images

Semantic Perturbations

Unlike pixel-level attacks, semantic perturbations change the meaning-preserving properties of inputs in ways that are natural and realistic:

Perturbation Type	Domain	Example
Synonym Substitution	NLP	"The movie was excellent" to "The film was outstanding"
Rotation & Scaling	Vision	Slight rotation of an image that changes classification
Typo Injection	NLP	"important" to "importnat" to bypass content filters
Color Jittering	Vision	Subtle brightness or contrast changes
Back-Translation	NLP	Translate to another language and back to get a paraphrase

Automated Testing Frameworks

Several open-source tools automate adversarial testing:

IBM ART

Adversarial Robustness Toolbox. Supports 40+ attacks and 30+ defenses across all major ML frameworks. The most comprehensive adversarial ML library.

TextAttack

NLP-focused adversarial framework. Provides 16+ attack recipes, augmentation methods, and model training with adversarial examples. Integrates with HuggingFace.

AutoAttack

Parameter-free, ensemble attack for reliable robustness evaluation. Combines APGD-CE, APGD-T, FAB, and Square attacks. The standard for RobustBench evaluations.

⚠

Important: Never evaluate robustness using only one attack. A model may appear robust to FGSM but be completely vulnerable to PGD or C&W. Use AutoAttack or a diverse ensemble of attacks for reliable evaluation.

← Previous Robustness Metrics Next → Distribution Shift