White-Box Attacks FGSM PGD
Lesson 2 of 7 in the Adversarial Attacks & Defenses course.
White-Box Adversarial Attacks
White-box attacks assume the attacker has complete access to the target model, including its architecture, parameters, and gradients. While this may seem unrealistic, it represents a worst-case scenario that defenses must be evaluated against. In practice, white-box access occurs when models are open-source, leaked, or when insiders are the threat actors.
Fast Gradient Sign Method (FGSM)
FGSM, introduced by Goodfellow et al. in 2014, is the simplest and fastest gradient-based attack. It generates adversarial examples in a single step by taking the sign of the gradient of the loss function with respect to the input:
The perturbation formula is: x_adv = x + epsilon * sign(gradient of loss with respect to x)
- Speed: Only requires one forward pass and one backward pass
- Simplicity: Easy to implement and understand
- Effectiveness: Surprisingly effective despite its simplicity
- Limitation: Single-step attack may not find the strongest adversarial example
import torch
import torch.nn.functional as F
def fgsm_attack(model, images, labels, epsilon):
"""Fast Gradient Sign Method (FGSM) attack.
Args:
model: PyTorch neural network
images: Input tensor (batch_size, channels, height, width)
labels: True labels tensor
epsilon: Maximum perturbation magnitude (L-inf)
Returns:
Adversarial images tensor
"""
# Enable gradient computation on input
images.requires_grad_(True)
# Forward pass
outputs = model(images)
loss = F.cross_entropy(outputs, labels)
# Backward pass to get gradients w.r.t. input
model.zero_grad()
loss.backward()
# Generate perturbation using sign of gradient
perturbation = epsilon * images.grad.sign()
# Create adversarial example
adversarial_images = images + perturbation
# Clamp to valid pixel range [0, 1]
adversarial_images = torch.clamp(adversarial_images, 0.0, 1.0)
return adversarial_images.detach()
# Usage example
# adv_images = fgsm_attack(model, images, labels, epsilon=0.03)
# adv_outputs = model(adv_images)
# adv_predictions = adv_outputs.argmax(dim=1)
# success_rate = (adv_predictions != labels).float().mean()
# print(f"Attack success rate: {success_rate:.2%}")
Projected Gradient Descent (PGD)
PGD, proposed by Madry et al. in 2017, is an iterative extension of FGSM that is widely considered the strongest first-order attack. It takes multiple smaller steps and projects back onto the allowed perturbation set after each step:
- Multi-step: Iteratively applies small FGSM-like steps for more effective perturbation
- Random start: Begins from a random point within the epsilon ball to avoid poor local optima
- Projection: After each step, clips the perturbation to stay within the epsilon budget
- Strength: Considered the gold standard for evaluating adversarial robustness
def pgd_attack(model, images, labels, epsilon, alpha, num_steps, random_start=True):
"""Projected Gradient Descent (PGD) attack.
Args:
model: PyTorch neural network
images: Input tensor
labels: True labels tensor
epsilon: Maximum perturbation magnitude (L-inf bound)
alpha: Step size for each iteration
num_steps: Number of PGD iterations
random_start: Whether to start from random point in epsilon ball
Returns:
Adversarial images tensor
"""
# Clone original images
adv_images = images.clone().detach()
if random_start:
# Start from random point within epsilon ball
adv_images = adv_images + torch.empty_like(adv_images).uniform_(-epsilon, epsilon)
adv_images = torch.clamp(adv_images, 0.0, 1.0)
for step in range(num_steps):
adv_images.requires_grad_(True)
# Forward pass
outputs = model(adv_images)
loss = F.cross_entropy(outputs, labels)
# Backward pass
model.zero_grad()
loss.backward()
# Take a step in the gradient direction
with torch.no_grad():
adv_images = adv_images + alpha * adv_images.grad.sign()
# Project back onto epsilon ball around original images
perturbation = adv_images - images
perturbation = torch.clamp(perturbation, -epsilon, epsilon)
adv_images = images + perturbation
# Clamp to valid range
adv_images = torch.clamp(adv_images, 0.0, 1.0)
return adv_images.detach()
# Typical PGD parameters for CIFAR-10
# epsilon = 8/255 # ~0.031 in [0,1] scale
# alpha = 2/255 # Step size per iteration
# num_steps = 20 # Number of iterations
Carlini & Wagner (C&W) Attack
The C&W attack is an optimization-based approach that finds the smallest perturbation that causes misclassification. It is more computationally expensive than FGSM or PGD but often finds smaller, harder-to-detect adversarial perturbations:
- Formulates adversarial example generation as an optimization problem
- Minimizes perturbation size while ensuring misclassification
- Supports L0, L2, and L-infinity norms
- Often defeats defenses that work against FGSM and PGD
DeepFool
DeepFool computes the minimal perturbation needed to cross the nearest decision boundary. It iteratively linearizes the classifier around the current point and finds the closest decision boundary:
- Produces smaller perturbations than FGSM for the same misclassification rate
- Provides a useful robustness metric: the average perturbation size across a dataset
- Works by computing the distance to the nearest decision boundary
Comparing Attack Methods
Different attacks have different strengths and use cases:
- FGSM: Fast evaluation, good for adversarial training data generation, weak as a standalone evaluation metric
- PGD: Gold standard for robustness evaluation, good balance of strength and computation cost
- C&W: Finds minimal perturbations, good for demonstrating vulnerability, expensive to compute
- DeepFool: Provides robustness metrics, useful for comparing model architectures
Summary
White-box attacks exploit full knowledge of the model to compute adversarial perturbations using gradients. FGSM provides a fast baseline, PGD offers the strongest first-order evaluation, and C&W finds minimal perturbations. Understanding these attacks is essential both for evaluating model robustness and for developing effective defenses, which we cover in later lessons.
Lilly Tech Systems