EC2 for Machine Learning Intermediate

EC2 provides the compute foundation for self-managed ML workloads on AWS. This lesson covers how to select, configure, and optimize EC2 GPU instances for training and inference, including EFA networking for distributed training and custom AMI preparation.

GPU Instance Selection

FamilyGPUUse CaseEFAOn-Demand $/hr
p5.48xlarge8x H100 80GBLLM trainingYes (3200 Gbps)~$98
p4d.24xlarge8x A100 40GBGeneral trainingYes (400 Gbps)~$32
g5.xlarge1x A10G 24GBInference, fine-tuningNo~$1.00
trn1.32xlarge16x TrainiumTraining (Neuron)Yes (800 Gbps)~$22
inf2.xlarge1x Inferentia2Inference (Neuron)No~$0.76

EFA Configuration for Distributed Training

Elastic Fabric Adapter (EFA) provides high-bandwidth, low-latency networking required for multi-node training:

AWS CLI
# Launch p5 instance with EFA enabled
aws ec2 run-instances \
  --instance-type p5.48xlarge \
  --image-id ami-0123456789abcdef0 \
  --network-interfaces "DeviceIndex=0,Groups=[sg-xxx],SubnetId=subnet-xxx,InterfaceType=efa" \
  --placement "GroupName=ml-training-cluster" \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=training-node-1}]'

ML AMI Configuration

  • AWS Deep Learning AMI — Pre-configured with CUDA, cuDNN, PyTorch, TensorFlow, and NCCL
  • AWS Deep Learning Containers — Docker images optimized for SageMaker and EKS
  • Custom AMI — Build your own with Packer for reproducible, version-controlled environments
Capacity Tip: GPU instances are frequently capacity-constrained. Use Capacity Reservations for predictable workloads and request quota increases well in advance of large training runs.

Ready to Build an S3 Data Lake?

The next lesson covers S3 architecture for ML training data management.

Next: S3 Data Lake →