EC2 for Machine Learning Intermediate
EC2 provides the compute foundation for self-managed ML workloads on AWS. This lesson covers how to select, configure, and optimize EC2 GPU instances for training and inference, including EFA networking for distributed training and custom AMI preparation.
GPU Instance Selection
| Family | GPU | Use Case | EFA | On-Demand $/hr |
|---|---|---|---|---|
| p5.48xlarge | 8x H100 80GB | LLM training | Yes (3200 Gbps) | ~$98 |
| p4d.24xlarge | 8x A100 40GB | General training | Yes (400 Gbps) | ~$32 |
| g5.xlarge | 1x A10G 24GB | Inference, fine-tuning | No | ~$1.00 |
| trn1.32xlarge | 16x Trainium | Training (Neuron) | Yes (800 Gbps) | ~$22 |
| inf2.xlarge | 1x Inferentia2 | Inference (Neuron) | No | ~$0.76 |
EFA Configuration for Distributed Training
Elastic Fabric Adapter (EFA) provides high-bandwidth, low-latency networking required for multi-node training:
AWS CLI
# Launch p5 instance with EFA enabled aws ec2 run-instances \ --instance-type p5.48xlarge \ --image-id ami-0123456789abcdef0 \ --network-interfaces "DeviceIndex=0,Groups=[sg-xxx],SubnetId=subnet-xxx,InterfaceType=efa" \ --placement "GroupName=ml-training-cluster" \ --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=training-node-1}]'
ML AMI Configuration
- AWS Deep Learning AMI — Pre-configured with CUDA, cuDNN, PyTorch, TensorFlow, and NCCL
- AWS Deep Learning Containers — Docker images optimized for SageMaker and EKS
- Custom AMI — Build your own with Packer for reproducible, version-controlled environments
Capacity Tip: GPU instances are frequently capacity-constrained. Use Capacity Reservations for predictable workloads and request quota increases well in advance of large training runs.
Ready to Build an S3 Data Lake?
The next lesson covers S3 architecture for ML training data management.
Next: S3 Data Lake →
Lilly Tech Systems