White-Box Attacks FGSM PGD

Lesson 2 of 7 in the Adversarial Attacks & Defenses course.

White-Box Adversarial Attacks

White-box attacks assume the attacker has complete access to the target model, including its architecture, parameters, and gradients. While this may seem unrealistic, it represents a worst-case scenario that defenses must be evaluated against. In practice, white-box access occurs when models are open-source, leaked, or when insiders are the threat actors.

Fast Gradient Sign Method (FGSM)

FGSM, introduced by Goodfellow et al. in 2014, is the simplest and fastest gradient-based attack. It generates adversarial examples in a single step by taking the sign of the gradient of the loss function with respect to the input:

The perturbation formula is: x_adv = x + epsilon * sign(gradient of loss with respect to x)

Speed: Only requires one forward pass and one backward pass
Simplicity: Easy to implement and understand
Effectiveness: Surprisingly effective despite its simplicity
Limitation: Single-step attack may not find the strongest adversarial example

Python

import torch
import torch.nn.functional as F

def fgsm_attack(model, images, labels, epsilon):
    """Fast Gradient Sign Method (FGSM) attack.

    Args:
        model: PyTorch neural network
        images: Input tensor (batch_size, channels, height, width)
        labels: True labels tensor
        epsilon: Maximum perturbation magnitude (L-inf)

    Returns:
        Adversarial images tensor
    """
    # Enable gradient computation on input
    images.requires_grad_(True)

    # Forward pass
    outputs = model(images)
    loss = F.cross_entropy(outputs, labels)

    # Backward pass to get gradients w.r.t. input
    model.zero_grad()
    loss.backward()

    # Generate perturbation using sign of gradient
    perturbation = epsilon * images.grad.sign()

    # Create adversarial example
    adversarial_images = images + perturbation

    # Clamp to valid pixel range [0, 1]
    adversarial_images = torch.clamp(adversarial_images, 0.0, 1.0)

    return adversarial_images.detach()

# Usage example
# adv_images = fgsm_attack(model, images, labels, epsilon=0.03)
# adv_outputs = model(adv_images)
# adv_predictions = adv_outputs.argmax(dim=1)
# success_rate = (adv_predictions != labels).float().mean()
# print(f"Attack success rate: {success_rate:.2%}")

Projected Gradient Descent (PGD)

PGD, proposed by Madry et al. in 2017, is an iterative extension of FGSM that is widely considered the strongest first-order attack. It takes multiple smaller steps and projects back onto the allowed perturbation set after each step:

Multi-step: Iteratively applies small FGSM-like steps for more effective perturbation
Random start: Begins from a random point within the epsilon ball to avoid poor local optima
Projection: After each step, clips the perturbation to stay within the epsilon budget
Strength: Considered the gold standard for evaluating adversarial robustness

Python

def pgd_attack(model, images, labels, epsilon, alpha, num_steps, random_start=True):
    """Projected Gradient Descent (PGD) attack.

    Args:
        model: PyTorch neural network
        images: Input tensor
        labels: True labels tensor
        epsilon: Maximum perturbation magnitude (L-inf bound)
        alpha: Step size for each iteration
        num_steps: Number of PGD iterations
        random_start: Whether to start from random point in epsilon ball

    Returns:
        Adversarial images tensor
    """
    # Clone original images
    adv_images = images.clone().detach()

    if random_start:
        # Start from random point within epsilon ball
        adv_images = adv_images + torch.empty_like(adv_images).uniform_(-epsilon, epsilon)
        adv_images = torch.clamp(adv_images, 0.0, 1.0)

    for step in range(num_steps):
        adv_images.requires_grad_(True)

        # Forward pass
        outputs = model(adv_images)
        loss = F.cross_entropy(outputs, labels)

        # Backward pass
        model.zero_grad()
        loss.backward()

        # Take a step in the gradient direction
        with torch.no_grad():
            adv_images = adv_images + alpha * adv_images.grad.sign()

            # Project back onto epsilon ball around original images
            perturbation = adv_images - images
            perturbation = torch.clamp(perturbation, -epsilon, epsilon)
            adv_images = images + perturbation

            # Clamp to valid range
            adv_images = torch.clamp(adv_images, 0.0, 1.0)

    return adv_images.detach()

# Typical PGD parameters for CIFAR-10
# epsilon = 8/255    # ~0.031 in [0,1] scale
# alpha = 2/255      # Step size per iteration
# num_steps = 20     # Number of iterations

💡

Best practice: When evaluating your model's robustness, always use PGD with at least 20 steps and 5 random restarts. A model that survives this evaluation has meaningful adversarial robustness. FGSM alone is insufficient for robustness evaluation.

Carlini & Wagner (C&W) Attack

The C&W attack is an optimization-based approach that finds the smallest perturbation that causes misclassification. It is more computationally expensive than FGSM or PGD but often finds smaller, harder-to-detect adversarial perturbations:

Formulates adversarial example generation as an optimization problem
Minimizes perturbation size while ensuring misclassification
Supports L0, L2, and L-infinity norms
Often defeats defenses that work against FGSM and PGD

DeepFool

DeepFool computes the minimal perturbation needed to cross the nearest decision boundary. It iteratively linearizes the classifier around the current point and finds the closest decision boundary:

Produces smaller perturbations than FGSM for the same misclassification rate
Provides a useful robustness metric: the average perturbation size across a dataset
Works by computing the distance to the nearest decision boundary

Comparing Attack Methods

Different attacks have different strengths and use cases:

FGSM: Fast evaluation, good for adversarial training data generation, weak as a standalone evaluation metric
PGD: Gold standard for robustness evaluation, good balance of strength and computation cost
C&W: Finds minimal perturbations, good for demonstrating vulnerability, expensive to compute
DeepFool: Provides robustness metrics, useful for comparing model architectures

⚠

Warning: Never rely on a single attack method to evaluate robustness. A model that is robust against FGSM may be completely vulnerable to PGD or C&W. Always evaluate against multiple attacks with different configurations.

Summary

White-box attacks exploit full knowledge of the model to compute adversarial perturbations using gradients. FGSM provides a fast baseline, PGD offers the strongest first-order evaluation, and C&W finds minimal perturbations. Understanding these attacks is essential both for evaluating model robustness and for developing effective defenses, which we cover in later lessons.

← PreviousAdversarial ML Overview Next →Black-Box Attack Methods