Intermediate

Robustness Metrics

Learn the quantitative measures used to evaluate model robustness, from accuracy under perturbation to certified robustness guarantees and standardized benchmark suites.

Why Metrics Matter

Without standardized metrics, robustness claims remain subjective. Metrics allow you to compare models objectively, set measurable requirements, and track improvement over time. The key is choosing the right metric for your specific threat model and deployment context.

Core Robustness Metrics

Metric	What It Measures	When to Use
Accuracy Under Perturbation	Model accuracy when inputs are perturbed within an epsilon ball	General adversarial robustness evaluation
Certified Robustness Radius	Maximum perturbation size guaranteed to not change the prediction	Safety-critical applications requiring formal guarantees
Mean Corruption Error (mCE)	Average error rate across a set of common corruptions	Evaluating robustness to natural image corruptions
Attack Success Rate	Percentage of inputs for which an attack successfully changes the output	Evaluating vulnerability to specific attack methods
Robustness Gap	Difference between clean accuracy and adversarial accuracy	Understanding the cost of robustness on clean performance

Accuracy Under Perturbation

The most intuitive robustness metric measures how model accuracy degrades as inputs are perturbed. For a given perturbation budget epsilon, adversarial accuracy is the fraction of test examples that remain correctly classified after worst-case perturbation.

Python - Measuring Adversarial Accuracy

from art.attacks.evasion import FastGradientMethod
from art.estimators.classification import PyTorchClassifier

def measure_adversarial_accuracy(model, x_test, y_test, epsilons):
    """Measure accuracy across different perturbation budgets."""
    classifier = PyTorchClassifier(model=model, ...)
    results = {}

    for eps in epsilons:
        attack = FastGradientMethod(
            estimator=classifier, eps=eps
        )
        x_adv = attack.generate(x=x_test)
        predictions = classifier.predict(x_adv)
        accuracy = np.mean(
            np.argmax(predictions, axis=1) == np.argmax(y_test, axis=1)
        )
        results[eps] = accuracy
        print(f"Epsilon: {eps:.3f} | Accuracy: {accuracy:.2%}")

    return results

# Example output:
# Epsilon: 0.000 | Accuracy: 95.20%
# Epsilon: 0.010 | Accuracy: 82.40%
# Epsilon: 0.030 | Accuracy: 61.10%
# Epsilon: 0.100 | Accuracy: 23.50%

Certified Robustness

Unlike empirical robustness (testing against specific attacks), certified robustness provides mathematical guarantees. Randomized smoothing is the most practical certified defense, creating a smoothed classifier that is provably robust within a certified radius.

💡

Trade-off alert: Certified robustness typically comes at a cost to clean accuracy. A model with a large certified radius may have lower accuracy on unperturbed inputs. The art is finding the right balance for your application.

Benchmark Suites

Standardized benchmarks enable fair comparison across models and defenses:

RobustBench

The most comprehensive leaderboard for adversarial robustness. Evaluates models on CIFAR-10, CIFAR-100, and ImageNet under AutoAttack with standardized threat models (Linf, L2).

ImageNet-C

15 types of corruption (noise, blur, weather, digital) at 5 severity levels. Measures mean Corruption Error (mCE) normalized against AlexNet performance.

GLUE-X / AdvGLUE

Robustness benchmarks for NLP models. Tests against textual adversarial attacks, paraphrases, and out-of-distribution examples across multiple tasks.

Measuring Robustness in NLP

Text-based models require different robustness metrics than vision models. Key measures include:

Word Error Rate Under Attack: How many word substitutions cause output to change.
Semantic Similarity Preservation: Whether adversarial text maintains its meaning while changing model output.
Invariance Score: Consistency of predictions across paraphrases of the same input.
Out-of-Vocabulary Robustness: Performance when encountering unseen words, typos, or slang.

✅

Practical tip: Start with RobustBench for vision models and TextAttack for NLP models. These tools provide standardized evaluation pipelines that produce comparable, publishable results.

← Previous Introduction Next → Perturbation Testing