Advanced

Model Serving on EKS

Deploy and serve ML models at scale on EKS using KServe, NVIDIA Triton, and TorchServe with auto-scaling, canary deployments, and monitoring.

Serving Framework Comparison

FeatureKServeTritonTorchServe
Multi-frameworkPyTorch only
Auto-scalingBuilt-in (KPA/HPA)Via K8s HPAVia K8s HPA
Canary/A-BNativeVia IstioVia Istio
GPU sharingVia Triton backendMPS, MIGLimited
Dynamic batchingVia runtime
Model managementInferenceService CRModel repositoryManagement API

KServe Deployment

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llm-serving
spec:
  predictor:
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://models/llm-v2"
      resources:
        limits:
          nvidia.com/gpu: 1
          memory: "16Gi"
        requests:
          cpu: "4"
          memory: "16Gi"
    minReplicas: 1
    maxReplicas: 10
    scaleTarget: 5  # concurrent requests per pod
  transformer:
    containers:
      - name: tokenizer
        image: tokenizer:latest

Triton Inference Server on EKS

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          args:
            - tritonserver
            - --model-repository=s3://models/triton-repo
            - --strict-model-config=false
          ports:
            - containerPort: 8000  # HTTP
            - containerPort: 8001  # gRPC
            - containerPort: 8002  # Metrics
          resources:
            limits:
              nvidia.com/gpu: 1

Canary Deployments with KServe

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: model-canary
spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://models/v2"  # New version
      resources:
        limits:
          nvidia.com/gpu: 1
Pro tip: Use KServe for its Kubernetes-native model management and auto-scaling capabilities. For maximum inference performance, run Triton as the KServe runtime backend. This gives you the best of both worlds: KServe's deployment management with Triton's optimized inference engine.