Where to Find Datasets Intermediate

A comprehensive guide to all major platforms and sources for finding machine learning datasets, from the largest hubs to specialized repositories and live data APIs.

Hugging Face Datasets Hub

The largest collection of ML datasets with 200,000+ datasets. Integrated with the datasets library for easy loading in Python. Supports filtering by task, size, language, and license.

Python
from datasets import load_dataset

# Load any dataset by name
dataset = load_dataset("imdb")

# Load a specific split
train = load_dataset("squad", split="train")

# Stream large datasets without downloading
dataset = load_dataset("c4", "en", streaming=True)

Kaggle Datasets

250,000+ datasets uploaded by the community. Features notebooks, competitions, and discussions. Many datasets come with starter code and exploratory analysis.

Terminal
# Install Kaggle CLI
$ pip install kaggle

# Download a dataset
$ kaggle datasets download -d zillow/zecon

# Search for datasets
$ kaggle datasets list -s "sentiment analysis"

UCI Machine Learning Repository

600+ datasets spanning classification, regression, clustering, and more. The longest-running ML dataset repository, with datasets from 1987 to present.

Google Dataset Search

A search engine specifically for datasets. Indexes datasets from across the web, including government portals, academic repositories, and data publishers. Available at datasetsearch.research.google.com.

AWS Open Data

Amazon's registry of open datasets hosted on AWS. Includes large-scale datasets like satellite imagery, genomics data, and weather data. Free to access and use with AWS tools.

Microsoft Open Datasets

Curated datasets hosted on Azure, optimized for use with Azure ML. Includes weather, demographics, holidays, and public safety data.

Papers With Code

Links datasets to the papers and models that use them. Shows benchmark leaderboards, making it easy to find the standard dataset for any ML task along with state-of-the-art results.

Web Scraping (Ethical)

Important: Always check a website's terms of service and robots.txt before scraping. Respect rate limits, don't collect personal data without consent, and comply with applicable laws (GDPR, CCPA, etc.).

APIs for Live Data

APIData TypePython Library
Twitter/X APISocial media poststweepy
Reddit APIForum discussionspraw
Yahoo FinanceStock pricesyfinance
OpenWeatherMapWeather datarequests
News APINews articlesnewsapi-python
Spotify APIMusic metadataspotipy

Comparison of Dataset Sources

SourceDatasetsBest ForAPI/CLICost
Hugging Face200K+NLP, ML researchYesFree
Kaggle250K+Competitions, tabularYesFree
UCI600+Classic ML benchmarksNoFree
Google Dataset Search25M+DiscoveryNoFree
AWS Open Data400+Large-scale dataAWS CLIFree (egress costs)
Papers With Code5K+Research benchmarksNoFree

Next Up

Learn how to create your own datasets from scratch with data collection, annotation, and quality control.

Next: Creating Datasets →