Intermediate

Benchmarks & SOTA Tracking

Understand how Papers With Code tracks state-of-the-art results across thousands of benchmarks, and how to use leaderboards to evaluate and compare ML models.

What Are Benchmarks?

A benchmark is a standardized evaluation setup consisting of a specific dataset, task, and metric. Benchmarks allow fair comparison between different methods. On Papers With Code, each benchmark has a leaderboard showing all submitted results ranked by performance.

Understanding Leaderboards

Each leaderboard on Papers With Code displays:

Rank: Position based on the primary evaluation metric
Model name: The method or model that produced the result
Score: The metric value (accuracy, F1, BLEU, mAP, etc.)
Paper: Link to the paper describing the method
Code: Link to the implementation if available
Date: When the result was reported

💡

SOTA means State-of-the-Art: The top-ranked result on a leaderboard is considered the current SOTA for that benchmark. Papers With Code tracks over 5,000 benchmarks across all areas of ML.

Popular Benchmark Categories

Category	Example Benchmarks	Key Metrics
Image Classification	ImageNet, CIFAR-10/100	Top-1 / Top-5 Accuracy
Object Detection	COCO, Pascal VOC	mAP, AP50
NLP Understanding	GLUE, SuperGLUE, SQuAD	Accuracy, F1, EM
Machine Translation	WMT, FLORES	BLEU, chrF
LLM Evaluation	MMLU, HumanEval, GSM8K	Accuracy, pass@k
Speech	LibriSpeech, Common Voice	WER (Word Error Rate)

How to Use Benchmarks Effectively

Identify Your Task
Start by finding the task that matches your problem (e.g., "Semantic Segmentation" or "Text Summarization"). Each task page lists all relevant benchmarks.
Compare Results
Look at the leaderboard to see which methods perform best. Pay attention not just to the top score but to the trend over time and the gap between methods.
Check for Code
Prioritize methods that have code available. An implementation you can run is worth more than a paper you cannot reproduce.
Consider Trade-offs
The highest-scoring model may require enormous compute. Look at efficiency metrics, model size, and inference speed alongside accuracy.

Reading Leaderboards Critically

Not all leaderboard results are equally reliable. Watch out for:

Evaluation protocol differences: Some papers use different data splits or preprocessing
Unreproduced results: Results without code may be difficult to verify
Ensemble vs single model: Some entries use ensembles that are impractical in production
Extra training data: Some models use additional data beyond the standard training set

✅

Practical advice: When selecting a model for a real project, look 2-3 positions below the top of the leaderboard. These models often provide 95% of the performance at a fraction of the cost and complexity.

← Previous Navigating the Site Next → Datasets

Benchmarks & SOTA Tracking

What Are Benchmarks?

Understanding Leaderboards

Popular Benchmark Categories

How to Use Benchmarks Effectively

Identify Your Task

Compare Results

Check for Code

Consider Trade-offs

Reading Leaderboards Critically