S3 Data Lake for ML Intermediate
S3 is the backbone of data storage for ML workloads on AWS. This lesson covers how to architect S3 buckets for training data, choose optimal data formats, integrate with Glue for cataloging, and maximize throughput for data-intensive training jobs.
S3 Bucket Architecture for ML
Bucket Structure
s3://ml-data-lake-{account-id}/
raw/ # Raw ingested data
processed/ # Cleaned, transformed data
features/ # Computed features for training
training-sets/ # Versioned train/val/test splits
s3://ml-artifacts-{account-id}/
models/ # Trained model artifacts
checkpoints/ # Training checkpoints
experiments/ # Experiment logs and metrics
s3://ml-serving-{account-id}/
production-models/ # Models deployed to production
inference-logs/ # Prediction logs for monitoring
Data Format Selection
| Format | Best For | Compression | Throughput |
|---|---|---|---|
| Parquet | Tabular/structured data | Excellent (columnar) | High |
| TFRecord | TensorFlow pipelines | Good | Very High |
| WebDataset | Image/multimodal training | Good (tar-based) | Very High |
| Arrow/Feather | In-memory analytics | Good | Excellent |
High-Performance S3 Access
- S3 Express One Zone — Single-digit millisecond latency, 10x faster than standard S3
- S3 VPC Endpoint — Avoid internet traversal for training data access
- Multi-part parallel reads — Read large files with multiple threads for higher throughput
- S3 Mountpoint — Mount S3 as a POSIX filesystem for legacy training scripts
- FSx for Lustre — High-performance filesystem backed by S3 for training data caching
Performance Tip: Use S3 prefix-based partitioning (e.g., by date or shard ID) to maximize S3 request throughput. S3 distributes objects across partitions based on prefix, so diverse prefixes enable higher aggregate throughput.
Ready to Configure VPC?
The next lesson covers VPC design for ML workloads on AWS.
Next: VPC Setup →
Lilly Tech Systems