Intermediate

Datasets on Papers With Code

Explore the comprehensive dataset repository on Papers With Code. Learn to find, evaluate, and select the right datasets for your machine learning projects.

The Dataset Repository

Papers With Code hosts metadata for over 7,000 datasets spanning every area of machine learning. Each dataset entry includes:

  • Description: What the dataset contains and its intended use
  • Statistics: Size, number of samples, splits, and dimensions
  • Papers: Research papers that use or introduced the dataset
  • Benchmarks: Leaderboards that evaluate models on this dataset
  • License: Usage rights and restrictions
  • Download links: Where to obtain the actual data

Finding the Right Dataset

Use these strategies to find datasets that match your needs:

Browse by Task

Navigate to a specific task (e.g., Sentiment Analysis, Object Detection) to see all datasets commonly used for that problem. This ensures you are using standard benchmarks that allow comparison with other work.

Search by Modality

Filter datasets by data type: text, images, audio, video, tabular, graphs, or multimodal combinations.

Sort by Popularity

Datasets used by more papers are generally better documented, have more baselines available, and are more widely accepted by the research community.

Dataset Cards

A dataset card is the detailed page for each dataset. When evaluating a dataset, pay attention to:

FactorWhat to Look For
SizeIs it large enough for your model? Too large for your compute budget?
QualityIs it well-curated with consistent labels? Are there known issues?
RecencyWhen was it created? Is it still relevant to current problems?
DiversityDoes it represent the real-world distribution you will encounter?
SplitsAre standard train/validation/test splits defined?
BaselinesAre there published baselines you can compare against?

Licensing Considerations

Always check the license before using a dataset. Common license types include:

  • CC BY: Free to use with attribution
  • CC BY-NC: Free for non-commercial use only
  • CC BY-SA: Free to use, but derivatives must use the same license
  • Research-only: Restricted to academic research purposes
  • Custom licenses: Read the terms carefully, especially for commercial use
💡
Commercial projects: If you plan to use a dataset for commercial purposes, verify the license explicitly. Many popular research datasets restrict commercial use. Consider alternatives like Common Crawl (for text) or LAION (for images) that have permissive licenses.

Popular Datasets by Domain

  • Computer Vision: ImageNet, COCO, ADE20K, Cityscapes, CelebA
  • NLP: SQuAD, GLUE, WMT, CNN/DailyMail, WikiText
  • Audio: LibriSpeech, Common Voice, AudioSet, VoxCeleb
  • Multimodal: LAION-5B, Conceptual Captions, VQA, Flickr30k
  • Tabular: UCI ML Repository datasets, Kaggle competition datasets