Dataset Best Practices Advanced
Follow these best practices to select high-quality datasets, handle common data challenges, version your data, and ensure ethical and reproducible ML workflows.
Dataset Selection Checklist
- Is the dataset relevant to your specific task and domain?
- Is it large enough for your model type?
- Is the label quality sufficient?
- Is the license compatible with your use case?
- Is it well-documented (datasheet, data card)?
- Has it been used in published research?
- Are there known biases or limitations?
Data Quality Checklist
import pandas as pd df = pd.read_csv("data.csv") # Check for missing values print(df.isnull().sum()) # Check for duplicates print(f"Duplicates: {df.duplicated().sum()}") # Check class distribution print(df["label"].value_counts(normalize=True)) # Check data types print(df.dtypes) # Basic statistics print(df.describe())
Handling Imbalanced Datasets
When one class is much more common than others (e.g., fraud detection: 99.8% legitimate, 0.2% fraudulent):
| Technique | How It Works | When to Use |
|---|---|---|
| Oversampling (SMOTE) | Generate synthetic minority samples | Small datasets, need more minority samples |
| Undersampling | Remove majority class samples | Large datasets, can afford to lose data |
| Class weights | Give higher weight to minority class in loss function | Any size, simplest approach |
| Threshold tuning | Adjust classification threshold | Binary classification, after training |
# SMOTE oversampling from imblearn.over_sampling import SMOTE smote = SMOTE(random_state=42) X_resampled, y_resampled = smote.fit_resample(X_train, y_train) # Class weights in scikit-learn from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(class_weight="balanced")
Train/Val/Test Split Strategies
| Strategy | When to Use |
|---|---|
| Random split (80/10/10) | Default for most tasks, large datasets |
| Stratified split | Imbalanced classes — preserves class distribution |
| Temporal split | Time series data — train on past, test on future |
| Group split | Multiple samples per entity (e.g., patient, user) |
Cross-Validation
from sklearn.model_selection import cross_val_score, StratifiedKFold # 5-fold stratified cross-validation cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=cv, scoring="f1_macro") print(f"F1: {scores.mean():.3f} (+/- {scores.std():.3f})")
Data Versioning with DVC
DVC (Data Version Control) tracks dataset versions alongside your code in Git:
# Initialize DVC in your git repo $ dvc init # Track a dataset $ dvc add data/training_data.csv # Push data to remote storage (S3, GCS, etc.) $ dvc remote add -d storage s3://my-bucket/dvc $ dvc push # Switch between dataset versions $ git checkout v1.0 $ dvc checkout
Ethical Dataset Usage
- Privacy: Remove or anonymize personally identifiable information (PII)
- Consent: Ensure data was collected with appropriate consent
- Bias audit: Check for demographic and representation biases
- Attribution: Credit dataset creators and cite papers
- License compliance: Follow the dataset's license terms
Documentation Standards
Every dataset should be documented with a datasheet (Gebru et al., 2021) or data card covering:
- Motivation and purpose
- Collection methodology
- Composition (size, format, splits)
- Preprocessing and labeling
- Uses and limitations
- Distribution and licensing
- Maintenance plan
Frequently Asked Questions
How much data do I need?
It depends on the task. For fine-tuning pretrained models, 100-1,000 samples can be sufficient. For training from scratch, you typically need 10,000+ samples. More complex tasks and larger models need more data. Start with what you have and measure whether more data improves performance.
Should I always use the largest dataset available?
Not necessarily. A smaller, high-quality dataset can outperform a larger noisy one. Focus on data quality (accurate labels, representative samples) over raw quantity. That said, for pretraining large models, scale is critical.
How do I handle missing values?
Options include: imputation (mean, median, mode, or model-based), deletion (if missing at random and small percentage), or using models that handle missing values natively (XGBoost, LightGBM). Never impute before splitting to avoid data leakage.
Can I combine multiple datasets?
Yes, but be careful about domain shift, label consistency, and duplicate samples. Normalize features across datasets and validate that combined performance is better than individual datasets.
How do I detect bias in my dataset?
Analyze the distribution of sensitive attributes (gender, race, age). Compare model performance across demographic groups (fairness metrics). Tools like Aequitas, Fairlearn, and AI Fairness 360 can help automate bias detection.
Course Complete!
You now have a comprehensive understanding of ML datasets — from classic benchmarks to creating and managing your own. Apply these practices to build better, more ethical machine learning systems.
← Back to Course Overview