AI Data Lifecycle Advanced

AI data has a lifecycle from creation through active use to archival and deletion. Without lifecycle management, storage costs grow unbounded as teams accumulate datasets, checkpoints, and model artifacts. This lesson covers policies and automation for managing each type of AI data through its complete lifecycle.

AI Data Types and Retention

Data TypeTypical SizeActive PeriodRetention Policy
Training datasets100GB - 10TBWeeks to monthsKeep versioned; archive after 6 months unused
Checkpoints10-100GB eachDuring trainingKeep best 3; delete rest after training completes
Model artifacts1-50GBWhile in productionKeep all production versions; archive after decommission
Experiment logs1-100MBDuring analysisKeep 1 year for reproducibility
TensorBoard logs100MB-10GBDuring experimentDelete after 30 days unless bookmarked

Checkpoint Management

Checkpoints are the largest source of storage waste in AI infrastructure:

  • Checkpoint frequency — Save every N steps, not every epoch; tune based on training stability
  • Rolling checkpoints — Keep only the last K checkpoints; automatically delete older ones
  • Best-model tracking — Save checkpoints that achieve best validation metric; delete others
  • Async checkpoint saving — Write checkpoints in background threads to avoid stalling training

Dataset Versioning

Version your datasets alongside your model code for reproducibility:

  • DVC (Data Version Control) — Git-like versioning for large files; stores data in remote storage with pointers in Git
  • Delta Lake / Iceberg — Table format versioning for structured datasets with time-travel queries
  • Object storage versioning — S3 versioning for immutable dataset snapshots

Automated Cleanup Policies

  • TTL-based deletion — Automatically delete temporary data (dev datasets, failed job outputs) after a defined period
  • Quota enforcement — Set per-team storage quotas; alert and enforce when exceeded
  • Orphan detection — Find and clean up data not referenced by any active experiment, model, or pipeline
Important: Always ensure you can reproduce any production model before deleting its training data. At minimum, keep the dataset version hash, training configuration, and model artifact. Many regulations also require retaining training data for audit purposes.

Ready for Best Practices?

The final lesson covers production storage operations, capacity planning, and disaster recovery.

Next: Best Practices →