Advanced

Auto-Scaling GPU Infrastructure

GPU instances cost $1-32/hour each. Running them idle wastes money; not having enough tanks latency. This lesson covers Kubernetes GPU scheduling, scale-from-zero patterns, custom scaling metrics, spot instances, and strategies to mitigate the 2-5 minute cold start that makes GPU autoscaling uniquely challenging.

Why GPU Autoscaling Is Hard

GPU autoscaling is fundamentally different from CPU autoscaling for three reasons:

Cold start time: A new GPU pod takes 2-5 minutes (pull container image, load model into GPU memory, warm up CUDA kernels). CPU pods start in seconds.
Cost: A single A100 GPU instance costs ~$3/hr on AWS. Over-provisioning 10 GPUs you do not need costs $30/hr = $720/day.
GPU is not divisible: You cannot give a pod "0.5 GPUs" in standard Kubernetes. Each pod gets whole GPUs, leading to discrete scaling steps with large cost jumps.

Kubernetes GPU Scheduling

Kubernetes supports GPU scheduling through the NVIDIA device plugin. GPUs are treated as extended resources that pods can request.

# Kubernetes deployment with GPU resources
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference
  namespace: ml-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: triton-inference
  template:
    metadata:
      labels:
        app: triton-inference
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8002"
    spec:
      # Schedule on GPU nodes only
      nodeSelector:
        nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

      # Topology: prefer co-location with model cache
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: triton-inference
                topologyKey: "kubernetes.io/hostname"

      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.08-py3
          command: ["tritonserver"]
          args:
            - "--model-repository=s3://models/production"
            - "--model-control-mode=poll"
            - "--repository-poll-secs=60"
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          resources:
            requests:
              nvidia.com/gpu: 1      # Request 1 GPU
              memory: "16Gi"
              cpu: "4"
            limits:
              nvidia.com/gpu: 1      # Limit to 1 GPU
              memory: "32Gi"
              cpu: "8"
          # Model cache volume for faster startup
          volumeMounts:
            - name: model-cache
              mountPath: /model-cache
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: model-cache
          hostPath:
            path: /mnt/model-cache
            type: DirectoryOrCreate
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "2Gi"

Custom Metrics for GPU Scaling

Standard CPU/memory-based HPA is useless for GPU workloads. You need custom metrics that reflect actual inference demand.

# Horizontal Pod Autoscaler with custom GPU metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triton-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triton-inference
  minReplicas: 2            # Minimum for high availability
  maxReplicas: 20
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 1 min before scaling up more
      policies:
        - type: Pods
          value: 4                       # Add up to 4 pods at a time
          periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min before scaling down
      policies:
        - type: Pods
          value: 1                       # Remove 1 pod at a time
          periodSeconds: 300
  metrics:
    # Primary: inference queue depth per pod
    - type: Pods
      pods:
        metric:
          name: triton_queue_depth
        target:
          type: AverageValue
          averageValue: "10"            # Scale when avg queue > 10

    # Secondary: GPU utilization
    - type: Pods
      pods:
        metric:
          name: gpu_utilization_percent
        target:
          type: AverageValue
          averageValue: "75"            # Scale when avg GPU > 75%

    # Tertiary: P99 inference latency
    - type: Pods
      pods:
        metric:
          name: inference_p99_latency_ms
        target:
          type: AverageValue
          averageValue: "50"            # Scale when P99 > 50ms

Exposing Custom Metrics

# Prometheus adapter configuration for custom metrics
# prometheus-adapter-config.yaml
rules:
  - seriesQuery: 'nv_inference_queue_duration_us{namespace="ml-serving"}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "nv_inference_queue_duration_us"
      as: "triton_queue_depth"
    metricsQuery: |
      sum(rate(nv_inference_request_success{<<.LabelMatchers>>}[2m]))
      / on(pod) sum(rate(nv_inference_exec_count{<<.LabelMatchers>>}[2m]))

  - seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{namespace="ml-serving"}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "DCGM_FI_DEV_GPU_UTIL"
      as: "gpu_utilization_percent"
    metricsQuery: 'avg_over_time(DCGM_FI_DEV_GPU_UTIL{<<.LabelMatchers>>}[2m])'

Scale-from-Zero with KEDA

For workloads with long idle periods (internal tools, dev environments, low-traffic models), scaling to zero saves significant costs. KEDA (Kubernetes Event-Driven Autoscaling) enables this.

# KEDA ScaledObject for scale-from-zero GPU inference
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: triton-scaledobject
  namespace: ml-serving
spec:
  scaleTargetRef:
    name: triton-inference
  minReplicaCount: 0              # Scale to zero!
  maxReplicaCount: 10
  cooldownPeriod: 600             # 10 min before scale-to-zero
  idleReplicaCount: 0
  pollingInterval: 15
  triggers:
    # Scale based on Prometheus queue depth
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: inference_pending_requests
        query: |
          sum(nv_inference_pending_request_count{
            namespace="ml-serving"
          })
        threshold: "1"             # Scale up when any requests pending
        activationThreshold: "0"   # Activate from zero on first request

    # Also scale based on HTTP request rate
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: inference_request_rate
        query: |
          sum(rate(http_requests_total{
            namespace="ml-serving",
            service="triton-inference"
          }[2m])) * 120
        threshold: "100"           # Scale up when > 100 req/2min

Cold Start Mitigation

A 2-5 minute cold start is unacceptable for real-time inference. Here are production-proven strategies to mitigate it:

Strategy 1: Warm Pool (Pre-provisioned Standby)

# Keep warm pods with model loaded but not receiving traffic
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-warm-pool
spec:
  replicas: 2    # Always keep 2 warm pods ready
  template:
    spec:
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.08-py3
          resources:
            requests:
              nvidia.com/gpu: 1
          # Health check passes, but Service selector excludes these pods
          # They are NOT in the active Service until traffic spike triggers
          # promotion via label change

# Controller that promotes warm pods to active on scale event:
# 1. Detect scale-up needed (queue depth spike)
# 2. Add label "role: active" to warm pod (instant - no cold start)
# 3. Start provisioning a new warm pod (2-5 min, but non-urgent)
# Net effect: 0-second cold start for first scale event

Strategy 2: Model Cache on Local SSD

# Cache model weights on node-local SSD to skip S3 download
# Typical cold start breakdown:
# Pull container image:     30-60s (if not cached)
# Download model from S3:   60-180s (10-100 GB model)
# Load model into GPU:      10-30s
# CUDA warm-up:             5-15s
# Total:                    2-5 minutes

# With local SSD cache:
# Pull container image:     0s (cached by DaemonSet pre-puller)
# Download model from S3:   0s (cached on local SSD)
# Load model into GPU:      10-30s
# CUDA warm-up:             5-15s
# Total:                    15-45 seconds (4-8x faster)

# DaemonSet that pre-pulls images and caches models
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: model-cache-warmer
spec:
  selector:
    matchLabels:
      app: model-cache-warmer
  template:
    spec:
      nodeSelector:
        nvidia.com/gpu: "true"
      containers:
        - name: cache-warmer
          image: model-cache-warmer:latest
          command: ["/bin/sh", "-c"]
          args:
            - |
              # Sync models from S3 to local cache every 5 minutes
              while true; do
                aws s3 sync s3://models/production /mnt/model-cache/ \
                  --exclude "*.tmp" \
                  --size-only
                sleep 300
              done
          volumeMounts:
            - name: model-cache
              mountPath: /mnt/model-cache
      volumes:
        - name: model-cache
          hostPath:
            path: /mnt/model-cache

Strategy 3: Predictive Scaling

# Scale based on predicted traffic, not reactive metrics
import numpy as np
from datetime import datetime, timedelta

class PredictiveScaler:
    def __init__(self, history_days=14):
        self.history = load_traffic_history(days=history_days)

    def predict_replicas(self, lookahead_minutes=15) -> int:
        """Predict needed replicas 15 minutes from now."""
        now = datetime.utcnow()
        future = now + timedelta(minutes=lookahead_minutes)

        # Use same time from previous weeks as prediction
        same_time_history = []
        for days_ago in [7, 14]:
            historical = now - timedelta(days=days_ago)
            traffic = self.history.get_traffic_at(historical)
            same_time_history.append(traffic)

        predicted_qps = np.percentile(same_time_history, 90)  # P90 of history

        # Add 20% buffer for safety
        predicted_qps *= 1.2

        # Each replica handles ~500 QPS for this model
        replicas = max(2, int(np.ceil(predicted_qps / 500)))
        return replicas

    def should_prescale(self) -> tuple[bool, int]:
        """Check if we should scale NOW for traffic 15 min from now."""
        target = self.predict_replicas(lookahead_minutes=15)
        current = get_current_replicas()

        if target > current:
            return True, target
        return False, current

# Run as a CronJob every 5 minutes
# kubectl create cronjob predictive-scaler \
#   --image=predictive-scaler:latest \
#   --schedule="*/5 * * * *"

Spot/Preemptible GPU Instances

Spot instances offer 60-90% savings on GPU costs but can be reclaimed with 2 minutes notice. They are excellent for inference workloads with proper handling.

Instance Type	On-Demand $/hr	Spot $/hr	Savings	Interruption Rate
AWS p4d.24xlarge (8x A100)	$32.77	$9.83	70%	~5-10%
AWS g5.xlarge (1x A10G)	$1.01	$0.30	70%	~5-15%
GCP a2-highgpu-1g (1x A100)	$3.67	$1.10	70%	~5-10%

# Mixed spot + on-demand strategy with Karpenter (AWS)
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: gpu-inference
spec:
  requirements:
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["g5.xlarge", "g5.2xlarge"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot", "on-demand"]
    - key: "nvidia.com/gpu"
      operator: Exists

  # Priority: prefer spot, fall back to on-demand
  weight: 10    # Higher weight = higher priority for this provisioner

  limits:
    resources:
      nvidia.com/gpu: 20

  # Handle spot interruptions gracefully
  ttlSecondsUntilExpired: 86400
  ttlSecondsAfterEmpty: 300

  providerRef:
    name: gpu-node-template

---
# Pod disruption budget to ensure availability during spot reclamation
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: triton-pdb
spec:
  minAvailable: 2    # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: triton-inference

💡

Production pattern: Run your minimum required capacity on on-demand instances and all burst capacity on spot instances. For example: 2 on-demand pods (guaranteed availability) + 0-18 spot pods (cost-effective scaling). This gives you reliability where it matters and savings where it is safe.

What Is Next

You now know how to scale GPU infrastructure efficiently with custom metrics, handle cold starts, and leverage spot instances. The next lesson covers A/B Testing and Canary Deployments — how to safely roll out new model versions with shadow deployments, traffic splitting, and statistical validation.

← Previous Batching & Routing Next → A/B Testing & Canary