Advanced
Deploying on Inferentia
Learn production deployment patterns for Inferentia-based inference, including SageMaker endpoints, containerized deployments on ECS/EKS, and direct EC2 setups.
Deployment Options
SageMaker Endpoints
Managed inference with auto-scaling, A/B testing, and model monitoring. Easiest path to production with minimal infrastructure management.
ECS / Fargate
Containerized inference using Amazon ECS with Neuron-optimized containers. Good for microservices architectures.
EKS (Kubernetes)
Kubernetes-native deployment with the Neuron device plugin. Ideal for teams already running K8s workloads.
EC2 Direct
Full control over the instance with custom serving frameworks like TorchServe, Triton, or FastAPI.
SageMaker Deployment
import sagemaker
from sagemaker.pytorch import PyTorchModel
# Create a SageMaker model with Inferentia
pytorch_model = PyTorchModel(
model_data="s3://bucket/model_neuron.tar.gz",
role="arn:aws:iam::role/SageMakerRole",
framework_version="2.1",
py_version="py310",
entry_point="inference.py",
)
# Deploy to an Inf2 endpoint
predictor = pytorch_model.deploy(
instance_type="ml.inf2.xlarge",
initial_instance_count=1,
endpoint_name="my-inferentia-endpoint",
)
# Run inference
result = predictor.predict({"inputs": "Hello world"})
EKS Deployment with Neuron Plugin
Deploy on Kubernetes using the AWS Neuron device plugin:
# Install Neuron device plugin DaemonSet
kubectl apply -f https://raw.githubusercontent.com/aws-neuron/
aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin.yml
# Pod spec requesting Neuron devices
apiVersion: v1
kind: Pod
spec:
containers:
- name: inference
image: your-neuron-container:latest
resources:
limits:
aws.amazon.com/neuron: 1 # Request 1 Neuron device
requests:
cpu: "4"
memory: "8Gi"
Auto-Scaling Strategies
| Strategy | Metric | Best For |
|---|---|---|
| Target tracking | InvocationsPerInstance | Steady traffic patterns |
| Step scaling | NeuronCore utilization | Variable traffic with bursts |
| Scheduled scaling | Time-based | Predictable daily/weekly patterns |
| KEDA (EKS) | Queue depth, custom metrics | Event-driven inference pipelines |
Pro tip: Always pre-compile your model and include the compiled artifact in your deployment package. Runtime compilation adds minutes to cold start times. For SageMaker, use model compilation during the CI/CD pipeline and store compiled models in S3.
Lilly Tech Systems