VPC Setup for ML on AWS Intermediate

Proper VPC configuration is essential for secure, high-performance ML workloads. This lesson covers VPC design patterns, subnet layouts, security group rules, and VPC endpoint configuration specifically for AI/ML infrastructure on AWS.

ML VPC Architecture

VPC Layout
VPC: 10.0.0.0/16 (ml-workloads-vpc)

Private Subnets (Training):
  10.0.1.0/24  - us-east-1a  (GPU training nodes)
  10.0.2.0/24  - us-east-1b  (GPU training nodes)

Private Subnets (Inference):
  10.0.10.0/24 - us-east-1a  (Model serving)
  10.0.11.0/24 - us-east-1b  (Model serving)

Private Subnets (Data):
  10.0.20.0/24 - us-east-1a  (Data processing)
  10.0.21.0/24 - us-east-1b  (Data processing)

VPC Endpoints:
  S3 Gateway, ECR, CloudWatch Logs, STS, SageMaker API

Security Group Rules for ML

Security GroupInboundOutbound
Training NodesAll traffic from same SG (NCCL)S3 endpoint, ECR endpoint
Inference EndpointsHTTPS from ALB SGS3 endpoint, CloudWatch
NotebooksHTTPS from corporate CIDRS3, ECR, PyPI endpoints

Essential VPC Endpoints

  • S3 Gateway Endpoint — Free, enables private S3 access without NAT Gateway costs
  • ECR Endpoints — Private access to container images (ecr.api and ecr.dkr)
  • CloudWatch Logs — Send training logs without internet access
  • STS — IAM role assumption for service-to-service authentication
  • SageMaker API/Runtime — Private access to SageMaker services
Cost Saving: Using an S3 Gateway Endpoint instead of a NAT Gateway can save thousands of dollars per month on data transfer costs for training data access.

Ready to Configure IAM?

The next lesson covers IAM roles, policies, and security for ML workloads.

Next: IAM →