Datasets on Papers With Code
Explore the comprehensive dataset repository on Papers With Code. Learn to find, evaluate, and select the right datasets for your machine learning projects.
The Dataset Repository
Papers With Code hosts metadata for over 7,000 datasets spanning every area of machine learning. Each dataset entry includes:
- Description: What the dataset contains and its intended use
- Statistics: Size, number of samples, splits, and dimensions
- Papers: Research papers that use or introduced the dataset
- Benchmarks: Leaderboards that evaluate models on this dataset
- License: Usage rights and restrictions
- Download links: Where to obtain the actual data
Finding the Right Dataset
Use these strategies to find datasets that match your needs:
Browse by Task
Navigate to a specific task (e.g., Sentiment Analysis, Object Detection) to see all datasets commonly used for that problem. This ensures you are using standard benchmarks that allow comparison with other work.
Search by Modality
Filter datasets by data type: text, images, audio, video, tabular, graphs, or multimodal combinations.
Sort by Popularity
Datasets used by more papers are generally better documented, have more baselines available, and are more widely accepted by the research community.
Dataset Cards
A dataset card is the detailed page for each dataset. When evaluating a dataset, pay attention to:
| Factor | What to Look For |
|---|---|
| Size | Is it large enough for your model? Too large for your compute budget? |
| Quality | Is it well-curated with consistent labels? Are there known issues? |
| Recency | When was it created? Is it still relevant to current problems? |
| Diversity | Does it represent the real-world distribution you will encounter? |
| Splits | Are standard train/validation/test splits defined? |
| Baselines | Are there published baselines you can compare against? |
Licensing Considerations
Always check the license before using a dataset. Common license types include:
- CC BY: Free to use with attribution
- CC BY-NC: Free for non-commercial use only
- CC BY-SA: Free to use, but derivatives must use the same license
- Research-only: Restricted to academic research purposes
- Custom licenses: Read the terms carefully, especially for commercial use
Popular Datasets by Domain
- Computer Vision: ImageNet, COCO, ADE20K, Cityscapes, CelebA
- NLP: SQuAD, GLUE, WMT, CNN/DailyMail, WikiText
- Audio: LibriSpeech, Common Voice, AudioSet, VoxCeleb
- Multimodal: LAION-5B, Conceptual Captions, VQA, Flickr30k
- Tabular: UCI ML Repository datasets, Kaggle competition datasets