EC2 for Machine Learning Intermediate

EC2 provides the compute foundation for self-managed ML workloads on AWS. This lesson covers how to select, configure, and optimize EC2 GPU instances for training and inference, including EFA networking for distributed training and custom AMI preparation.

GPU Instance Selection

Family	GPU	Use Case	EFA	On-Demand $/hr
p5.48xlarge	8x H100 80GB	LLM training	Yes (3200 Gbps)	~$98
p4d.24xlarge	8x A100 40GB	General training	Yes (400 Gbps)	~$32
g5.xlarge	1x A10G 24GB	Inference, fine-tuning	No	~$1.00
trn1.32xlarge	16x Trainium	Training (Neuron)	Yes (800 Gbps)	~$22
inf2.xlarge	1x Inferentia2	Inference (Neuron)	No	~$0.76

EFA Configuration for Distributed Training

Elastic Fabric Adapter (EFA) provides high-bandwidth, low-latency networking required for multi-node training:

AWS CLI

# Launch p5 instance with EFA enabled
aws ec2 run-instances \
  --instance-type p5.48xlarge \
  --image-id ami-0123456789abcdef0 \
  --network-interfaces "DeviceIndex=0,Groups=[sg-xxx],SubnetId=subnet-xxx,InterfaceType=efa" \
  --placement "GroupName=ml-training-cluster" \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=training-node-1}]'

ML AMI Configuration

AWS Deep Learning AMI — Pre-configured with CUDA, cuDNN, PyTorch, TensorFlow, and NCCL
AWS Deep Learning Containers — Docker images optimized for SageMaker and EKS
Custom AMI — Build your own with Packer for reproducible, version-controlled environments

Capacity Tip: GPU instances are frequently capacity-constrained. Use Capacity Reservations for predictable workloads and request quota increases well in advance of large training runs.

Ready to Build an S3 Data Lake?

The next lesson covers S3 architecture for ML training data management.

Next: S3 Data Lake →

← Introduction S3 Data Lake →