Advanced
Model Serving on EKS
Deploy and serve ML models at scale on EKS using KServe, NVIDIA Triton, and TorchServe with auto-scaling, canary deployments, and monitoring.
Serving Framework Comparison
| Feature | KServe | Triton | TorchServe |
|---|---|---|---|
| Multi-framework | ✓ | ✓ | PyTorch only |
| Auto-scaling | Built-in (KPA/HPA) | Via K8s HPA | Via K8s HPA |
| Canary/A-B | Native | Via Istio | Via Istio |
| GPU sharing | Via Triton backend | MPS, MIG | Limited |
| Dynamic batching | Via runtime | ✓ | ✓ |
| Model management | InferenceService CR | Model repository | Management API |
KServe Deployment
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llm-serving
spec:
predictor:
model:
modelFormat:
name: pytorch
storageUri: "s3://models/llm-v2"
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
requests:
cpu: "4"
memory: "16Gi"
minReplicas: 1
maxReplicas: 10
scaleTarget: 5 # concurrent requests per pod
transformer:
containers:
- name: tokenizer
image: tokenizer:latest
Triton Inference Server on EKS
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-inference
spec:
replicas: 2
template:
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.01-py3
args:
- tritonserver
- --model-repository=s3://models/triton-repo
- --strict-model-config=false
ports:
- containerPort: 8000 # HTTP
- containerPort: 8001 # gRPC
- containerPort: 8002 # Metrics
resources:
limits:
nvidia.com/gpu: 1
Canary Deployments with KServe
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: model-canary
spec:
predictor:
canaryTrafficPercent: 10
model:
modelFormat:
name: pytorch
storageUri: "s3://models/v2" # New version
resources:
limits:
nvidia.com/gpu: 1
Pro tip: Use KServe for its Kubernetes-native model management and auto-scaling capabilities. For maximum inference performance, run Triton as the KServe runtime backend. This gives you the best of both worlds: KServe's deployment management with Triton's optimized inference engine.
Lilly Tech Systems