S3 Data Lake for ML Intermediate

S3 is the backbone of data storage for ML workloads on AWS. This lesson covers how to architect S3 buckets for training data, choose optimal data formats, integrate with Glue for cataloging, and maximize throughput for data-intensive training jobs.

S3 Bucket Architecture for ML

Bucket Structure
s3://ml-data-lake-{account-id}/
  raw/                    # Raw ingested data
  processed/              # Cleaned, transformed data
  features/               # Computed features for training
  training-sets/          # Versioned train/val/test splits

s3://ml-artifacts-{account-id}/
  models/                 # Trained model artifacts
  checkpoints/            # Training checkpoints
  experiments/            # Experiment logs and metrics

s3://ml-serving-{account-id}/
  production-models/      # Models deployed to production
  inference-logs/         # Prediction logs for monitoring

Data Format Selection

FormatBest ForCompressionThroughput
ParquetTabular/structured dataExcellent (columnar)High
TFRecordTensorFlow pipelinesGoodVery High
WebDatasetImage/multimodal trainingGood (tar-based)Very High
Arrow/FeatherIn-memory analyticsGoodExcellent

High-Performance S3 Access

  • S3 Express One Zone — Single-digit millisecond latency, 10x faster than standard S3
  • S3 VPC Endpoint — Avoid internet traversal for training data access
  • Multi-part parallel reads — Read large files with multiple threads for higher throughput
  • S3 Mountpoint — Mount S3 as a POSIX filesystem for legacy training scripts
  • FSx for Lustre — High-performance filesystem backed by S3 for training data caching
Performance Tip: Use S3 prefix-based partitioning (e.g., by date or shard ID) to maximize S3 request throughput. S3 distributes objects across partitions based on prefix, so diverse prefixes enable higher aggregate throughput.

Ready to Configure VPC?

The next lesson covers VPC design for ML workloads on AWS.

Next: VPC Setup →