Where to Find Datasets Intermediate
A comprehensive guide to all major platforms and sources for finding machine learning datasets, from the largest hubs to specialized repositories and live data APIs.
Hugging Face Datasets Hub
The largest collection of ML datasets with 200,000+ datasets. Integrated with the datasets library for easy loading in Python. Supports filtering by task, size, language, and license.
from datasets import load_dataset # Load any dataset by name dataset = load_dataset("imdb") # Load a specific split train = load_dataset("squad", split="train") # Stream large datasets without downloading dataset = load_dataset("c4", "en", streaming=True)
Kaggle Datasets
250,000+ datasets uploaded by the community. Features notebooks, competitions, and discussions. Many datasets come with starter code and exploratory analysis.
# Install Kaggle CLI $ pip install kaggle # Download a dataset $ kaggle datasets download -d zillow/zecon # Search for datasets $ kaggle datasets list -s "sentiment analysis"
UCI Machine Learning Repository
600+ datasets spanning classification, regression, clustering, and more. The longest-running ML dataset repository, with datasets from 1987 to present.
Google Dataset Search
A search engine specifically for datasets. Indexes datasets from across the web, including government portals, academic repositories, and data publishers. Available at datasetsearch.research.google.com.
AWS Open Data
Amazon's registry of open datasets hosted on AWS. Includes large-scale datasets like satellite imagery, genomics data, and weather data. Free to access and use with AWS tools.
Microsoft Open Datasets
Curated datasets hosted on Azure, optimized for use with Azure ML. Includes weather, demographics, holidays, and public safety data.
Papers With Code
Links datasets to the papers and models that use them. Shows benchmark leaderboards, making it easy to find the standard dataset for any ML task along with state-of-the-art results.
Web Scraping (Ethical)
APIs for Live Data
| API | Data Type | Python Library |
|---|---|---|
| Twitter/X API | Social media posts | tweepy |
| Reddit API | Forum discussions | praw |
| Yahoo Finance | Stock prices | yfinance |
| OpenWeatherMap | Weather data | requests |
| News API | News articles | newsapi-python |
| Spotify API | Music metadata | spotipy |
Comparison of Dataset Sources
| Source | Datasets | Best For | API/CLI | Cost |
|---|---|---|---|---|
| Hugging Face | 200K+ | NLP, ML research | Yes | Free |
| Kaggle | 250K+ | Competitions, tabular | Yes | Free |
| UCI | 600+ | Classic ML benchmarks | No | Free |
| Google Dataset Search | 25M+ | Discovery | No | Free |
| AWS Open Data | 400+ | Large-scale data | AWS CLI | Free (egress costs) |
| Papers With Code | 5K+ | Research benchmarks | No | Free |
Next Up
Learn how to create your own datasets from scratch with data collection, annotation, and quality control.
Next: Creating Datasets →
Lilly Tech Systems