Auto-Scaling GPU Infrastructure
GPU instances cost $1-32/hour each. Running them idle wastes money; not having enough tanks latency. This lesson covers Kubernetes GPU scheduling, scale-from-zero patterns, custom scaling metrics, spot instances, and strategies to mitigate the 2-5 minute cold start that makes GPU autoscaling uniquely challenging.
Why GPU Autoscaling Is Hard
GPU autoscaling is fundamentally different from CPU autoscaling for three reasons:
- Cold start time: A new GPU pod takes 2-5 minutes (pull container image, load model into GPU memory, warm up CUDA kernels). CPU pods start in seconds.
- Cost: A single A100 GPU instance costs ~$3/hr on AWS. Over-provisioning 10 GPUs you do not need costs $30/hr = $720/day.
- GPU is not divisible: You cannot give a pod "0.5 GPUs" in standard Kubernetes. Each pod gets whole GPUs, leading to discrete scaling steps with large cost jumps.
Kubernetes GPU Scheduling
Kubernetes supports GPU scheduling through the NVIDIA device plugin. GPUs are treated as extended resources that pods can request.
# Kubernetes deployment with GPU resources
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-inference
namespace: ml-serving
spec:
replicas: 3
selector:
matchLabels:
app: triton-inference
template:
metadata:
labels:
app: triton-inference
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8002"
spec:
# Schedule on GPU nodes only
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
# Topology: prefer co-location with model cache
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: triton-inference
topologyKey: "kubernetes.io/hostname"
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.08-py3
command: ["tritonserver"]
args:
- "--model-repository=s3://models/production"
- "--model-control-mode=poll"
- "--repository-poll-secs=60"
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: grpc
- containerPort: 8002
name: metrics
resources:
requests:
nvidia.com/gpu: 1 # Request 1 GPU
memory: "16Gi"
cpu: "4"
limits:
nvidia.com/gpu: 1 # Limit to 1 GPU
memory: "32Gi"
cpu: "8"
# Model cache volume for faster startup
volumeMounts:
- name: model-cache
mountPath: /model-cache
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
hostPath:
path: /mnt/model-cache
type: DirectoryOrCreate
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
Custom Metrics for GPU Scaling
Standard CPU/memory-based HPA is useless for GPU workloads. You need custom metrics that reflect actual inference demand.
# Horizontal Pod Autoscaler with custom GPU metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: triton-hpa
namespace: ml-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triton-inference
minReplicas: 2 # Minimum for high availability
maxReplicas: 20
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait 1 min before scaling up more
policies:
- type: Pods
value: 4 # Add up to 4 pods at a time
periodSeconds: 120
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Pods
value: 1 # Remove 1 pod at a time
periodSeconds: 300
metrics:
# Primary: inference queue depth per pod
- type: Pods
pods:
metric:
name: triton_queue_depth
target:
type: AverageValue
averageValue: "10" # Scale when avg queue > 10
# Secondary: GPU utilization
- type: Pods
pods:
metric:
name: gpu_utilization_percent
target:
type: AverageValue
averageValue: "75" # Scale when avg GPU > 75%
# Tertiary: P99 inference latency
- type: Pods
pods:
metric:
name: inference_p99_latency_ms
target:
type: AverageValue
averageValue: "50" # Scale when P99 > 50ms
Exposing Custom Metrics
# Prometheus adapter configuration for custom metrics
# prometheus-adapter-config.yaml
rules:
- seriesQuery: 'nv_inference_queue_duration_us{namespace="ml-serving"}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "nv_inference_queue_duration_us"
as: "triton_queue_depth"
metricsQuery: |
sum(rate(nv_inference_request_success{<<.LabelMatchers>>}[2m]))
/ on(pod) sum(rate(nv_inference_exec_count{<<.LabelMatchers>>}[2m]))
- seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{namespace="ml-serving"}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "DCGM_FI_DEV_GPU_UTIL"
as: "gpu_utilization_percent"
metricsQuery: 'avg_over_time(DCGM_FI_DEV_GPU_UTIL{<<.LabelMatchers>>}[2m])'
Scale-from-Zero with KEDA
For workloads with long idle periods (internal tools, dev environments, low-traffic models), scaling to zero saves significant costs. KEDA (Kubernetes Event-Driven Autoscaling) enables this.
# KEDA ScaledObject for scale-from-zero GPU inference
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: triton-scaledobject
namespace: ml-serving
spec:
scaleTargetRef:
name: triton-inference
minReplicaCount: 0 # Scale to zero!
maxReplicaCount: 10
cooldownPeriod: 600 # 10 min before scale-to-zero
idleReplicaCount: 0
pollingInterval: 15
triggers:
# Scale based on Prometheus queue depth
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: inference_pending_requests
query: |
sum(nv_inference_pending_request_count{
namespace="ml-serving"
})
threshold: "1" # Scale up when any requests pending
activationThreshold: "0" # Activate from zero on first request
# Also scale based on HTTP request rate
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: inference_request_rate
query: |
sum(rate(http_requests_total{
namespace="ml-serving",
service="triton-inference"
}[2m])) * 120
threshold: "100" # Scale up when > 100 req/2min
Cold Start Mitigation
A 2-5 minute cold start is unacceptable for real-time inference. Here are production-proven strategies to mitigate it:
Strategy 1: Warm Pool (Pre-provisioned Standby)
# Keep warm pods with model loaded but not receiving traffic
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-warm-pool
spec:
replicas: 2 # Always keep 2 warm pods ready
template:
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.08-py3
resources:
requests:
nvidia.com/gpu: 1
# Health check passes, but Service selector excludes these pods
# They are NOT in the active Service until traffic spike triggers
# promotion via label change
# Controller that promotes warm pods to active on scale event:
# 1. Detect scale-up needed (queue depth spike)
# 2. Add label "role: active" to warm pod (instant - no cold start)
# 3. Start provisioning a new warm pod (2-5 min, but non-urgent)
# Net effect: 0-second cold start for first scale event
Strategy 2: Model Cache on Local SSD
# Cache model weights on node-local SSD to skip S3 download
# Typical cold start breakdown:
# Pull container image: 30-60s (if not cached)
# Download model from S3: 60-180s (10-100 GB model)
# Load model into GPU: 10-30s
# CUDA warm-up: 5-15s
# Total: 2-5 minutes
# With local SSD cache:
# Pull container image: 0s (cached by DaemonSet pre-puller)
# Download model from S3: 0s (cached on local SSD)
# Load model into GPU: 10-30s
# CUDA warm-up: 5-15s
# Total: 15-45 seconds (4-8x faster)
# DaemonSet that pre-pulls images and caches models
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: model-cache-warmer
spec:
selector:
matchLabels:
app: model-cache-warmer
template:
spec:
nodeSelector:
nvidia.com/gpu: "true"
containers:
- name: cache-warmer
image: model-cache-warmer:latest
command: ["/bin/sh", "-c"]
args:
- |
# Sync models from S3 to local cache every 5 minutes
while true; do
aws s3 sync s3://models/production /mnt/model-cache/ \
--exclude "*.tmp" \
--size-only
sleep 300
done
volumeMounts:
- name: model-cache
mountPath: /mnt/model-cache
volumes:
- name: model-cache
hostPath:
path: /mnt/model-cache
Strategy 3: Predictive Scaling
# Scale based on predicted traffic, not reactive metrics
import numpy as np
from datetime import datetime, timedelta
class PredictiveScaler:
def __init__(self, history_days=14):
self.history = load_traffic_history(days=history_days)
def predict_replicas(self, lookahead_minutes=15) -> int:
"""Predict needed replicas 15 minutes from now."""
now = datetime.utcnow()
future = now + timedelta(minutes=lookahead_minutes)
# Use same time from previous weeks as prediction
same_time_history = []
for days_ago in [7, 14]:
historical = now - timedelta(days=days_ago)
traffic = self.history.get_traffic_at(historical)
same_time_history.append(traffic)
predicted_qps = np.percentile(same_time_history, 90) # P90 of history
# Add 20% buffer for safety
predicted_qps *= 1.2
# Each replica handles ~500 QPS for this model
replicas = max(2, int(np.ceil(predicted_qps / 500)))
return replicas
def should_prescale(self) -> tuple[bool, int]:
"""Check if we should scale NOW for traffic 15 min from now."""
target = self.predict_replicas(lookahead_minutes=15)
current = get_current_replicas()
if target > current:
return True, target
return False, current
# Run as a CronJob every 5 minutes
# kubectl create cronjob predictive-scaler \
# --image=predictive-scaler:latest \
# --schedule="*/5 * * * *"
Spot/Preemptible GPU Instances
Spot instances offer 60-90% savings on GPU costs but can be reclaimed with 2 minutes notice. They are excellent for inference workloads with proper handling.
| Instance Type | On-Demand $/hr | Spot $/hr | Savings | Interruption Rate |
|---|---|---|---|---|
| AWS p4d.24xlarge (8x A100) | $32.77 | $9.83 | 70% | ~5-10% |
| AWS g5.xlarge (1x A10G) | $1.01 | $0.30 | 70% | ~5-15% |
| GCP a2-highgpu-1g (1x A100) | $3.67 | $1.10 | 70% | ~5-10% |
# Mixed spot + on-demand strategy with Karpenter (AWS)
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: gpu-inference
spec:
requirements:
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["g5.xlarge", "g5.2xlarge"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"]
- key: "nvidia.com/gpu"
operator: Exists
# Priority: prefer spot, fall back to on-demand
weight: 10 # Higher weight = higher priority for this provisioner
limits:
resources:
nvidia.com/gpu: 20
# Handle spot interruptions gracefully
ttlSecondsUntilExpired: 86400
ttlSecondsAfterEmpty: 300
providerRef:
name: gpu-node-template
---
# Pod disruption budget to ensure availability during spot reclamation
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: triton-pdb
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: triton-inference
What Is Next
You now know how to scale GPU infrastructure efficiently with custom metrics, handle cold starts, and leverage spot instances. The next lesson covers A/B Testing and Canary Deployments — how to safely roll out new model versions with shadow deployments, traffic splitting, and statistical validation.
Lilly Tech Systems