Adversarial ML Overview

Lesson 1 of 7 in the Adversarial Attacks & Defenses course.

What Is Adversarial Machine Learning?

Adversarial machine learning is the study of attacks on ML systems through malicious manipulation of inputs, training data, or model parameters, and the defenses against such attacks. It sits at the intersection of machine learning and computer security, revealing fundamental vulnerabilities in how neural networks and other ML models process information.

The field gained significant attention in 2013 when Szegedy et al. demonstrated that imperceptible perturbations to images could cause state-of-the-art neural networks to misclassify with high confidence. Since then, adversarial ML has grown into a critical area of research with direct implications for every deployed ML system.

Why ML Models Are Vulnerable

The root cause of adversarial vulnerability lies in how ML models learn decision boundaries:

High-dimensional input spaces: Neural networks operate in extremely high-dimensional spaces where small perturbations can cross decision boundaries
Linear behavior in high dimensions: Despite non-linear activations, neural networks are surprisingly linear in high-dimensional spaces, making them susceptible to linear perturbation attacks
Distributional assumptions: Models assume test data comes from the same distribution as training data, but adversaries can craft inputs outside this distribution
Overconfident predictions: Neural networks often assign high confidence to predictions, even for adversarial or out-of-distribution inputs

💡

Key insight: Adversarial examples are not bugs in specific models — they are a fundamental property of how current ML architectures learn. Any model that learns from data is potentially vulnerable to adversarial manipulation.

Types of Adversarial Attacks

Adversarial attacks can be classified along several dimensions:

By Attacker Knowledge

White-box attacks: The attacker has complete access to the model architecture, weights, and training data. This enables the most powerful attacks using gradient information
Black-box attacks: The attacker can only query the model and observe outputs. Attacks rely on transfer learning or query-based optimization
Gray-box attacks: The attacker has partial knowledge, such as the model architecture but not the exact weights

By Attack Goal

Untargeted attacks: Cause the model to make any incorrect prediction
Targeted attacks: Cause the model to predict a specific class chosen by the attacker

By Attack Timing

Evasion attacks: Applied at inference time to fool a deployed model
Poisoning attacks: Applied at training time to corrupt the model learning process
Backdoor attacks: Embed a hidden trigger during training that activates at inference

A Simple Adversarial Example

Here is a basic demonstration of how adversarial perturbations work conceptually:

Python

import numpy as np

def simple_adversarial_demo():
    """Demonstrate the concept of adversarial perturbation."""
    # Simulate a simple linear classifier: y = sign(w . x)
    # This classifier separates two classes based on a weight vector
    np.random.seed(42)

    # Model weights (learned from training)
    weights = np.array([0.5, -0.3, 0.8, -0.1, 0.6])

    # Original input (correctly classified as positive)
    x_original = np.array([1.0, 0.5, 0.7, 0.3, 0.9])
    score_original = np.dot(weights, x_original)
    print(f"Original score: {score_original:.4f} -> Class: {'Positive' if score_original > 0 else 'Negative'}")

    # Adversarial perturbation using FGSM concept:
    # Move in the direction that decreases the score
    epsilon = 0.3  # Perturbation budget
    perturbation = -epsilon * np.sign(weights)

    # Adversarial example
    x_adversarial = x_original + perturbation
    score_adversarial = np.dot(weights, x_adversarial)
    print(f"Adversarial score: {score_adversarial:.4f} -> Class: {'Positive' if score_adversarial > 0 else 'Negative'}")

    # Measure perturbation size
    l_inf = np.max(np.abs(perturbation))
    l2 = np.linalg.norm(perturbation)
    print(f"L-inf perturbation: {l_inf:.4f}")
    print(f"L2 perturbation: {l2:.4f}")

simple_adversarial_demo()

Perturbation Budgets and Norms

Adversarial attacks are constrained by a perturbation budget that limits how much the input can be modified. Common norms include:

L-infinity (L∞): Maximum change to any single feature. Common for image attacks where each pixel can change by at most epsilon
L2: Euclidean distance between original and perturbed input. Allows larger changes to individual features as long as the total perturbation is small
L0: Number of features changed. Limits attacks to modifying only a few features
L1: Sum of absolute changes. Encourages sparse perturbations

Real-World Impact

Adversarial attacks have been demonstrated against real-world systems:

Autonomous vehicles: Small stickers on stop signs can cause misclassification by self-driving car vision systems
Malware detection: Adversarial modifications to malware can bypass ML-based antivirus systems while preserving malicious functionality
Speech recognition: Imperceptible audio perturbations can cause voice assistants to execute hidden commands
Facial recognition: Adversarial glasses or patches can cause misidentification in surveillance systems
Spam filters: Carefully crafted text modifications can bypass NLP-based email filtering

⚠

Warning: Many published adversarial attacks work under idealized conditions (white-box access, unlimited queries). However, research has shown that even practical black-box attacks with limited queries can be highly effective against production systems.

The Arms Race

Adversarial ML is an ongoing arms race between attackers and defenders. New defenses are proposed, then broken by more sophisticated attacks, which in turn drive more robust defenses. Understanding both sides of this arms race is essential for building secure AI systems.

In the following lessons, we will dive deep into specific attack methods (white-box and black-box), explore transferability, and then cover the most effective defense techniques available today.

Next →White-Box Attacks FGSM PGD