Dataset Best Practices Advanced

Follow these best practices to select high-quality datasets, handle common data challenges, version your data, and ensure ethical and reproducible ML workflows.

Dataset Selection Checklist

Is the dataset relevant to your specific task and domain?
Is it large enough for your model type?
Is the label quality sufficient?
Is the license compatible with your use case?
Is it well-documented (datasheet, data card)?
Has it been used in published research?
Are there known biases or limitations?

Data Quality Checklist

Python

import pandas as pd

df = pd.read_csv("data.csv")

# Check for missing values
print(df.isnull().sum())

# Check for duplicates
print(f"Duplicates: {df.duplicated().sum()}")

# Check class distribution
print(df["label"].value_counts(normalize=True))

# Check data types
print(df.dtypes)

# Basic statistics
print(df.describe())

Handling Imbalanced Datasets

When one class is much more common than others (e.g., fraud detection: 99.8% legitimate, 0.2% fraudulent):

Technique	How It Works	When to Use
Oversampling (SMOTE)	Generate synthetic minority samples	Small datasets, need more minority samples
Undersampling	Remove majority class samples	Large datasets, can afford to lose data
Class weights	Give higher weight to minority class in loss function	Any size, simplest approach
Threshold tuning	Adjust classification threshold	Binary classification, after training

Python

# SMOTE oversampling
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Class weights in scikit-learn
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(class_weight="balanced")

Train/Val/Test Split Strategies

Strategy	When to Use
Random split (80/10/10)	Default for most tasks, large datasets
Stratified split	Imbalanced classes — preserves class distribution
Temporal split	Time series data — train on past, test on future
Group split	Multiple samples per entity (e.g., patient, user)

Cross-Validation

Python

from sklearn.model_selection import cross_val_score, StratifiedKFold

# 5-fold stratified cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring="f1_macro")
print(f"F1: {scores.mean():.3f} (+/- {scores.std():.3f})")

Data Versioning with DVC

DVC (Data Version Control) tracks dataset versions alongside your code in Git:

Terminal

# Initialize DVC in your git repo
$ dvc init

# Track a dataset
$ dvc add data/training_data.csv

# Push data to remote storage (S3, GCS, etc.)
$ dvc remote add -d storage s3://my-bucket/dvc
$ dvc push

# Switch between dataset versions
$ git checkout v1.0
$ dvc checkout

Ethical Dataset Usage

Privacy: Remove or anonymize personally identifiable information (PII)
Consent: Ensure data was collected with appropriate consent
Bias audit: Check for demographic and representation biases
Attribution: Credit dataset creators and cite papers
License compliance: Follow the dataset's license terms

Documentation Standards

Every dataset should be documented with a datasheet (Gebru et al., 2021) or data card covering:

Motivation and purpose
Collection methodology
Composition (size, format, splits)
Preprocessing and labeling
Uses and limitations
Distribution and licensing
Maintenance plan

Frequently Asked Questions

How much data do I need?

It depends on the task. For fine-tuning pretrained models, 100-1,000 samples can be sufficient. For training from scratch, you typically need 10,000+ samples. More complex tasks and larger models need more data. Start with what you have and measure whether more data improves performance.

Should I always use the largest dataset available?

Not necessarily. A smaller, high-quality dataset can outperform a larger noisy one. Focus on data quality (accurate labels, representative samples) over raw quantity. That said, for pretraining large models, scale is critical.

How do I handle missing values?

Options include: imputation (mean, median, mode, or model-based), deletion (if missing at random and small percentage), or using models that handle missing values natively (XGBoost, LightGBM). Never impute before splitting to avoid data leakage.

Can I combine multiple datasets?

Yes, but be careful about domain shift, label consistency, and duplicate samples. Normalize features across datasets and validate that combined performance is better than individual datasets.

How do I detect bias in my dataset?

Analyze the distribution of sensitive attributes (gender, race, age). Compare model performance across demographic groups (fairness metrics). Tools like Aequitas, Fairlearn, and AI Fairness 360 can help automate bias detection.

Course Complete!

You now have a comprehensive understanding of ML datasets — from classic benchmarks to creating and managing your own. Apply these practices to build better, more ethical machine learning systems.

← Back to Course Overview

← Creating Datasets Course Overview →