Intermediate

Benchmarks & SOTA Tracking

Understand how Papers With Code tracks state-of-the-art results across thousands of benchmarks, and how to use leaderboards to evaluate and compare ML models.

What Are Benchmarks?

A benchmark is a standardized evaluation setup consisting of a specific dataset, task, and metric. Benchmarks allow fair comparison between different methods. On Papers With Code, each benchmark has a leaderboard showing all submitted results ranked by performance.

Understanding Leaderboards

Each leaderboard on Papers With Code displays:

  • Rank: Position based on the primary evaluation metric
  • Model name: The method or model that produced the result
  • Score: The metric value (accuracy, F1, BLEU, mAP, etc.)
  • Paper: Link to the paper describing the method
  • Code: Link to the implementation if available
  • Date: When the result was reported
💡
SOTA means State-of-the-Art: The top-ranked result on a leaderboard is considered the current SOTA for that benchmark. Papers With Code tracks over 5,000 benchmarks across all areas of ML.

Popular Benchmark Categories

CategoryExample BenchmarksKey Metrics
Image ClassificationImageNet, CIFAR-10/100Top-1 / Top-5 Accuracy
Object DetectionCOCO, Pascal VOCmAP, AP50
NLP UnderstandingGLUE, SuperGLUE, SQuADAccuracy, F1, EM
Machine TranslationWMT, FLORESBLEU, chrF
LLM EvaluationMMLU, HumanEval, GSM8KAccuracy, pass@k
SpeechLibriSpeech, Common VoiceWER (Word Error Rate)

How to Use Benchmarks Effectively

  1. Identify Your Task

    Start by finding the task that matches your problem (e.g., "Semantic Segmentation" or "Text Summarization"). Each task page lists all relevant benchmarks.

  2. Compare Results

    Look at the leaderboard to see which methods perform best. Pay attention not just to the top score but to the trend over time and the gap between methods.

  3. Check for Code

    Prioritize methods that have code available. An implementation you can run is worth more than a paper you cannot reproduce.

  4. Consider Trade-offs

    The highest-scoring model may require enormous compute. Look at efficiency metrics, model size, and inference speed alongside accuracy.

Reading Leaderboards Critically

Not all leaderboard results are equally reliable. Watch out for:

  • Evaluation protocol differences: Some papers use different data splits or preprocessing
  • Unreproduced results: Results without code may be difficult to verify
  • Ensemble vs single model: Some entries use ensembles that are impractical in production
  • Extra training data: Some models use additional data beyond the standard training set
Practical advice: When selecting a model for a real project, look 2-3 positions below the top of the leaderboard. These models often provide 95% of the performance at a fraction of the cost and complexity.