VPC Setup for ML on AWS Intermediate
Proper VPC configuration is essential for secure, high-performance ML workloads. This lesson covers VPC design patterns, subnet layouts, security group rules, and VPC endpoint configuration specifically for AI/ML infrastructure on AWS.
ML VPC Architecture
VPC Layout
VPC: 10.0.0.0/16 (ml-workloads-vpc) Private Subnets (Training): 10.0.1.0/24 - us-east-1a (GPU training nodes) 10.0.2.0/24 - us-east-1b (GPU training nodes) Private Subnets (Inference): 10.0.10.0/24 - us-east-1a (Model serving) 10.0.11.0/24 - us-east-1b (Model serving) Private Subnets (Data): 10.0.20.0/24 - us-east-1a (Data processing) 10.0.21.0/24 - us-east-1b (Data processing) VPC Endpoints: S3 Gateway, ECR, CloudWatch Logs, STS, SageMaker API
Security Group Rules for ML
| Security Group | Inbound | Outbound |
|---|---|---|
| Training Nodes | All traffic from same SG (NCCL) | S3 endpoint, ECR endpoint |
| Inference Endpoints | HTTPS from ALB SG | S3 endpoint, CloudWatch |
| Notebooks | HTTPS from corporate CIDR | S3, ECR, PyPI endpoints |
Essential VPC Endpoints
- S3 Gateway Endpoint — Free, enables private S3 access without NAT Gateway costs
- ECR Endpoints — Private access to container images (ecr.api and ecr.dkr)
- CloudWatch Logs — Send training logs without internet access
- STS — IAM role assumption for service-to-service authentication
- SageMaker API/Runtime — Private access to SageMaker services
Cost Saving: Using an S3 Gateway Endpoint instead of a NAT Gateway can save thousands of dollars per month on data transfer costs for training data access.
Ready to Configure IAM?
The next lesson covers IAM roles, policies, and security for ML workloads.
Next: IAM →
Lilly Tech Systems