Intermediate

GPU Time-Slicing

Configure NVIDIA GPU time-slicing to share a single physical GPU across multiple pods, enabling cost-effective utilization for development, inference, and lightweight training workloads.

What is GPU Time-Slicing?

GPU time-slicing enables multiple containers to share a single GPU by rapidly switching between workloads, similar to how operating systems time-slice CPU cores across processes. Unlike MIG, time-slicing does not provide memory isolation — all workloads share the same GPU memory space.

💡
When to use time-slicing: Time-slicing is ideal for development environments, lightweight inference, and interactive notebooks where workloads are bursty and don't require guaranteed GPU resources. For production workloads requiring isolation, consider MIG instead.

Configuring Time-Slicing

Time-slicing is configured through the NVIDIA device plugin's ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin
  namespace: kube-system
data:
  config: |
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Each GPU appears as 4 virtual GPUs

With replicas: 4, a node with 2 physical GPUs will advertise 8 nvidia.com/gpu resources. Each pod still requests nvidia.com/gpu: 1 but now receives a time-sliced share.

Deployment Example

# Deploy 4 inference pods sharing 1 GPU
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-service
spec:
  replicas: 4
  selector:
    matchLabels:
      app: inference
  template:
    metadata:
      labels:
        app: inference
    spec:
      containers:
      - name: model-server
        image: nvcr.io/nvidia/tritonserver:24.01-py3
        resources:
          limits:
            nvidia.com/gpu: 1  # Gets a time-slice
        ports:
        - containerPort: 8000

Time-Slicing vs MIG Comparison

FeatureTime-SlicingMIG
Memory isolationNo (shared)Yes (partitioned)
Compute isolationNo (best-effort)Yes (guaranteed)
GPU supportAll NVIDIA GPUsA100, H100 only
ConfigurationSimple ConfigMapNode-level partitioning
Use caseDev, light inferenceProduction, multi-tenant
OversubscriptionPossible (OOM risk)Not possible
Best practice: Monitor GPU memory usage closely when using time-slicing. Since there is no memory isolation, one pod can cause out-of-memory errors for all pods sharing the GPU. Set conservative replica counts based on your workload's memory footprint.