S3 Data Lake for ML Intermediate

S3 is the backbone of data storage for ML workloads on AWS. This lesson covers how to architect S3 buckets for training data, choose optimal data formats, integrate with Glue for cataloging, and maximize throughput for data-intensive training jobs.

S3 Bucket Architecture for ML

Bucket Structure

s3://ml-data-lake-{account-id}/
  raw/                    # Raw ingested data
  processed/              # Cleaned, transformed data
  features/               # Computed features for training
  training-sets/          # Versioned train/val/test splits

s3://ml-artifacts-{account-id}/
  models/                 # Trained model artifacts
  checkpoints/            # Training checkpoints
  experiments/            # Experiment logs and metrics

s3://ml-serving-{account-id}/
  production-models/      # Models deployed to production
  inference-logs/         # Prediction logs for monitoring

Data Format Selection

Format	Best For	Compression	Throughput
Parquet	Tabular/structured data	Excellent (columnar)	High
TFRecord	TensorFlow pipelines	Good	Very High
WebDataset	Image/multimodal training	Good (tar-based)	Very High
Arrow/Feather	In-memory analytics	Good	Excellent

High-Performance S3 Access

S3 Express One Zone — Single-digit millisecond latency, 10x faster than standard S3
S3 VPC Endpoint — Avoid internet traversal for training data access
Multi-part parallel reads — Read large files with multiple threads for higher throughput
S3 Mountpoint — Mount S3 as a POSIX filesystem for legacy training scripts
FSx for Lustre — High-performance filesystem backed by S3 for training data caching

Performance Tip: Use S3 prefix-based partitioning (e.g., by date or shard ID) to maximize S3 request throughput. S3 distributes objects across partitions based on prefix, so diverse prefixes enable higher aggregate throughput.

Ready to Configure VPC?

The next lesson covers VPC design for ML workloads on AWS.

Next: VPC Setup →

← EC2 for ML VPC Setup →