Adversarial ML Overview
Lesson 1 of 7 in the Adversarial Attacks & Defenses course.
What Is Adversarial Machine Learning?
Adversarial machine learning is the study of attacks on ML systems through malicious manipulation of inputs, training data, or model parameters, and the defenses against such attacks. It sits at the intersection of machine learning and computer security, revealing fundamental vulnerabilities in how neural networks and other ML models process information.
The field gained significant attention in 2013 when Szegedy et al. demonstrated that imperceptible perturbations to images could cause state-of-the-art neural networks to misclassify with high confidence. Since then, adversarial ML has grown into a critical area of research with direct implications for every deployed ML system.
Why ML Models Are Vulnerable
The root cause of adversarial vulnerability lies in how ML models learn decision boundaries:
- High-dimensional input spaces: Neural networks operate in extremely high-dimensional spaces where small perturbations can cross decision boundaries
- Linear behavior in high dimensions: Despite non-linear activations, neural networks are surprisingly linear in high-dimensional spaces, making them susceptible to linear perturbation attacks
- Distributional assumptions: Models assume test data comes from the same distribution as training data, but adversaries can craft inputs outside this distribution
- Overconfident predictions: Neural networks often assign high confidence to predictions, even for adversarial or out-of-distribution inputs
Types of Adversarial Attacks
Adversarial attacks can be classified along several dimensions:
By Attacker Knowledge
- White-box attacks: The attacker has complete access to the model architecture, weights, and training data. This enables the most powerful attacks using gradient information
- Black-box attacks: The attacker can only query the model and observe outputs. Attacks rely on transfer learning or query-based optimization
- Gray-box attacks: The attacker has partial knowledge, such as the model architecture but not the exact weights
By Attack Goal
- Untargeted attacks: Cause the model to make any incorrect prediction
- Targeted attacks: Cause the model to predict a specific class chosen by the attacker
By Attack Timing
- Evasion attacks: Applied at inference time to fool a deployed model
- Poisoning attacks: Applied at training time to corrupt the model learning process
- Backdoor attacks: Embed a hidden trigger during training that activates at inference
A Simple Adversarial Example
Here is a basic demonstration of how adversarial perturbations work conceptually:
import numpy as np
def simple_adversarial_demo():
"""Demonstrate the concept of adversarial perturbation."""
# Simulate a simple linear classifier: y = sign(w . x)
# This classifier separates two classes based on a weight vector
np.random.seed(42)
# Model weights (learned from training)
weights = np.array([0.5, -0.3, 0.8, -0.1, 0.6])
# Original input (correctly classified as positive)
x_original = np.array([1.0, 0.5, 0.7, 0.3, 0.9])
score_original = np.dot(weights, x_original)
print(f"Original score: {score_original:.4f} -> Class: {'Positive' if score_original > 0 else 'Negative'}")
# Adversarial perturbation using FGSM concept:
# Move in the direction that decreases the score
epsilon = 0.3 # Perturbation budget
perturbation = -epsilon * np.sign(weights)
# Adversarial example
x_adversarial = x_original + perturbation
score_adversarial = np.dot(weights, x_adversarial)
print(f"Adversarial score: {score_adversarial:.4f} -> Class: {'Positive' if score_adversarial > 0 else 'Negative'}")
# Measure perturbation size
l_inf = np.max(np.abs(perturbation))
l2 = np.linalg.norm(perturbation)
print(f"L-inf perturbation: {l_inf:.4f}")
print(f"L2 perturbation: {l2:.4f}")
simple_adversarial_demo()
Perturbation Budgets and Norms
Adversarial attacks are constrained by a perturbation budget that limits how much the input can be modified. Common norms include:
- L-infinity (L∞): Maximum change to any single feature. Common for image attacks where each pixel can change by at most epsilon
- L2: Euclidean distance between original and perturbed input. Allows larger changes to individual features as long as the total perturbation is small
- L0: Number of features changed. Limits attacks to modifying only a few features
- L1: Sum of absolute changes. Encourages sparse perturbations
Real-World Impact
Adversarial attacks have been demonstrated against real-world systems:
- Autonomous vehicles: Small stickers on stop signs can cause misclassification by self-driving car vision systems
- Malware detection: Adversarial modifications to malware can bypass ML-based antivirus systems while preserving malicious functionality
- Speech recognition: Imperceptible audio perturbations can cause voice assistants to execute hidden commands
- Facial recognition: Adversarial glasses or patches can cause misidentification in surveillance systems
- Spam filters: Carefully crafted text modifications can bypass NLP-based email filtering
The Arms Race
Adversarial ML is an ongoing arms race between attackers and defenders. New defenses are proposed, then broken by more sophisticated attacks, which in turn drive more robust defenses. Understanding both sides of this arms race is essential for building secure AI systems.
In the following lessons, we will dive deep into specific attack methods (white-box and black-box), explore transferability, and then cover the most effective defense techniques available today.
Lilly Tech Systems