Intermediate

Datasets on Papers With Code

Explore the comprehensive dataset repository on Papers With Code. Learn to find, evaluate, and select the right datasets for your machine learning projects.

The Dataset Repository

Papers With Code hosts metadata for over 7,000 datasets spanning every area of machine learning. Each dataset entry includes:

Description: What the dataset contains and its intended use
Statistics: Size, number of samples, splits, and dimensions
Papers: Research papers that use or introduced the dataset
Benchmarks: Leaderboards that evaluate models on this dataset
License: Usage rights and restrictions
Download links: Where to obtain the actual data

Finding the Right Dataset

Use these strategies to find datasets that match your needs:

Browse by Task

Navigate to a specific task (e.g., Sentiment Analysis, Object Detection) to see all datasets commonly used for that problem. This ensures you are using standard benchmarks that allow comparison with other work.

Search by Modality

Filter datasets by data type: text, images, audio, video, tabular, graphs, or multimodal combinations.

Sort by Popularity

Datasets used by more papers are generally better documented, have more baselines available, and are more widely accepted by the research community.

Dataset Cards

A dataset card is the detailed page for each dataset. When evaluating a dataset, pay attention to:

Factor	What to Look For
Size	Is it large enough for your model? Too large for your compute budget?
Quality	Is it well-curated with consistent labels? Are there known issues?
Recency	When was it created? Is it still relevant to current problems?
Diversity	Does it represent the real-world distribution you will encounter?
Splits	Are standard train/validation/test splits defined?
Baselines	Are there published baselines you can compare against?

Licensing Considerations

Always check the license before using a dataset. Common license types include:

CC BY: Free to use with attribution
CC BY-NC: Free for non-commercial use only
CC BY-SA: Free to use, but derivatives must use the same license
Research-only: Restricted to academic research purposes
Custom licenses: Read the terms carefully, especially for commercial use

💡

Commercial projects: If you plan to use a dataset for commercial purposes, verify the license explicitly. Many popular research datasets restrict commercial use. Consider alternatives like Common Crawl (for text) or LAION (for images) that have permissive licenses.

Popular Datasets by Domain

Computer Vision: ImageNet, COCO, ADE20K, Cityscapes, CelebA
NLP: SQuAD, GLUE, WMT, CNN/DailyMail, WikiText
Audio: LibriSpeech, Common Voice, AudioSet, VoxCeleb
Multimodal: LAION-5B, Conceptual Captions, VQA, Flickr30k
Tabular: UCI ML Repository datasets, Kaggle competition datasets

← Previous Benchmarks Next → Methods