Advanced

Deploying on Inferentia

Learn production deployment patterns for Inferentia-based inference, including SageMaker endpoints, containerized deployments on ECS/EKS, and direct EC2 setups.

Deployment Options

🚀

SageMaker Endpoints

Managed inference with auto-scaling, A/B testing, and model monitoring. Easiest path to production with minimal infrastructure management.

📦

ECS / Fargate

Containerized inference using Amazon ECS with Neuron-optimized containers. Good for microservices architectures.

⚙

EKS (Kubernetes)

Kubernetes-native deployment with the Neuron device plugin. Ideal for teams already running K8s workloads.

💻

EC2 Direct

Full control over the instance with custom serving frameworks like TorchServe, Triton, or FastAPI.

SageMaker Deployment

import sagemaker
from sagemaker.pytorch import PyTorchModel

# Create a SageMaker model with Inferentia
pytorch_model = PyTorchModel(
    model_data="s3://bucket/model_neuron.tar.gz",
    role="arn:aws:iam::role/SageMakerRole",
    framework_version="2.1",
    py_version="py310",
    entry_point="inference.py",
)

# Deploy to an Inf2 endpoint
predictor = pytorch_model.deploy(
    instance_type="ml.inf2.xlarge",
    initial_instance_count=1,
    endpoint_name="my-inferentia-endpoint",
)

# Run inference
result = predictor.predict({"inputs": "Hello world"})

EKS Deployment with Neuron Plugin

Deploy on Kubernetes using the AWS Neuron device plugin:

# Install Neuron device plugin DaemonSet
kubectl apply -f https://raw.githubusercontent.com/aws-neuron/
  aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin.yml

# Pod spec requesting Neuron devices
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: inference
    image: your-neuron-container:latest
    resources:
      limits:
        aws.amazon.com/neuron: 1  # Request 1 Neuron device
      requests:
        cpu: "4"
        memory: "8Gi"

Auto-Scaling Strategies

Strategy	Metric	Best For
Target tracking	InvocationsPerInstance	Steady traffic patterns
Step scaling	NeuronCore utilization	Variable traffic with bursts
Scheduled scaling	Time-based	Predictable daily/weekly patterns
KEDA (EKS)	Queue depth, custom metrics	Event-driven inference pipelines

✅

Pro tip: Always pre-compile your model and include the compiled artifact in your deployment package. Runtime compilation adds minutes to cold start times. For SageMaker, use model compilation during the CI/CD pipeline and store compiled models in S3.

← Previous Neuron SDK Next → Best Practices