AI Data Lifecycle Advanced
AI data has a lifecycle from creation through active use to archival and deletion. Without lifecycle management, storage costs grow unbounded as teams accumulate datasets, checkpoints, and model artifacts. This lesson covers policies and automation for managing each type of AI data through its complete lifecycle.
AI Data Types and Retention
| Data Type | Typical Size | Active Period | Retention Policy |
|---|---|---|---|
| Training datasets | 100GB - 10TB | Weeks to months | Keep versioned; archive after 6 months unused |
| Checkpoints | 10-100GB each | During training | Keep best 3; delete rest after training completes |
| Model artifacts | 1-50GB | While in production | Keep all production versions; archive after decommission |
| Experiment logs | 1-100MB | During analysis | Keep 1 year for reproducibility |
| TensorBoard logs | 100MB-10GB | During experiment | Delete after 30 days unless bookmarked |
Checkpoint Management
Checkpoints are the largest source of storage waste in AI infrastructure:
- Checkpoint frequency — Save every N steps, not every epoch; tune based on training stability
- Rolling checkpoints — Keep only the last K checkpoints; automatically delete older ones
- Best-model tracking — Save checkpoints that achieve best validation metric; delete others
- Async checkpoint saving — Write checkpoints in background threads to avoid stalling training
Dataset Versioning
Version your datasets alongside your model code for reproducibility:
- DVC (Data Version Control) — Git-like versioning for large files; stores data in remote storage with pointers in Git
- Delta Lake / Iceberg — Table format versioning for structured datasets with time-travel queries
- Object storage versioning — S3 versioning for immutable dataset snapshots
Automated Cleanup Policies
- TTL-based deletion — Automatically delete temporary data (dev datasets, failed job outputs) after a defined period
- Quota enforcement — Set per-team storage quotas; alert and enforce when exceeded
- Orphan detection — Find and clean up data not referenced by any active experiment, model, or pipeline
Important: Always ensure you can reproduce any production model before deleting its training data. At minimum, keep the dataset version hash, training configuration, and model artifact. Many regulations also require retaining training data for audit purposes.
Ready for Best Practices?
The final lesson covers production storage operations, capacity planning, and disaster recovery.
Next: Best Practices →
Lilly Tech Systems