Benchmarks & SOTA Tracking
Understand how Papers With Code tracks state-of-the-art results across thousands of benchmarks, and how to use leaderboards to evaluate and compare ML models.
What Are Benchmarks?
A benchmark is a standardized evaluation setup consisting of a specific dataset, task, and metric. Benchmarks allow fair comparison between different methods. On Papers With Code, each benchmark has a leaderboard showing all submitted results ranked by performance.
Understanding Leaderboards
Each leaderboard on Papers With Code displays:
- Rank: Position based on the primary evaluation metric
- Model name: The method or model that produced the result
- Score: The metric value (accuracy, F1, BLEU, mAP, etc.)
- Paper: Link to the paper describing the method
- Code: Link to the implementation if available
- Date: When the result was reported
Popular Benchmark Categories
| Category | Example Benchmarks | Key Metrics |
|---|---|---|
| Image Classification | ImageNet, CIFAR-10/100 | Top-1 / Top-5 Accuracy |
| Object Detection | COCO, Pascal VOC | mAP, AP50 |
| NLP Understanding | GLUE, SuperGLUE, SQuAD | Accuracy, F1, EM |
| Machine Translation | WMT, FLORES | BLEU, chrF |
| LLM Evaluation | MMLU, HumanEval, GSM8K | Accuracy, pass@k |
| Speech | LibriSpeech, Common Voice | WER (Word Error Rate) |
How to Use Benchmarks Effectively
Identify Your Task
Start by finding the task that matches your problem (e.g., "Semantic Segmentation" or "Text Summarization"). Each task page lists all relevant benchmarks.
Compare Results
Look at the leaderboard to see which methods perform best. Pay attention not just to the top score but to the trend over time and the gap between methods.
Check for Code
Prioritize methods that have code available. An implementation you can run is worth more than a paper you cannot reproduce.
Consider Trade-offs
The highest-scoring model may require enormous compute. Look at efficiency metrics, model size, and inference speed alongside accuracy.
Reading Leaderboards Critically
Not all leaderboard results are equally reliable. Watch out for:
- Evaluation protocol differences: Some papers use different data splits or preprocessing
- Unreproduced results: Results without code may be difficult to verify
- Ensemble vs single model: Some entries use ensembles that are impractical in production
- Extra training data: Some models use additional data beyond the standard training set