Intermediate
GPU Time-Slicing
Configure NVIDIA GPU time-slicing to share a single physical GPU across multiple pods, enabling cost-effective utilization for development, inference, and lightweight training workloads.
What is GPU Time-Slicing?
GPU time-slicing enables multiple containers to share a single GPU by rapidly switching between workloads, similar to how operating systems time-slice CPU cores across processes. Unlike MIG, time-slicing does not provide memory isolation — all workloads share the same GPU memory space.
When to use time-slicing: Time-slicing is ideal for development environments, lightweight inference, and interactive notebooks where workloads are bursty and don't require guaranteed GPU resources. For production workloads requiring isolation, consider MIG instead.
Configuring Time-Slicing
Time-slicing is configured through the NVIDIA device plugin's ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin
namespace: kube-system
data:
config: |
version: v1
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4 # Each GPU appears as 4 virtual GPUs
With replicas: 4, a node with 2 physical GPUs will advertise 8 nvidia.com/gpu resources. Each pod still requests nvidia.com/gpu: 1 but now receives a time-sliced share.
Deployment Example
# Deploy 4 inference pods sharing 1 GPU
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference-service
spec:
replicas: 4
selector:
matchLabels:
app: inference
template:
metadata:
labels:
app: inference
spec:
containers:
- name: model-server
image: nvcr.io/nvidia/tritonserver:24.01-py3
resources:
limits:
nvidia.com/gpu: 1 # Gets a time-slice
ports:
- containerPort: 8000
Time-Slicing vs MIG Comparison
| Feature | Time-Slicing | MIG |
|---|---|---|
| Memory isolation | No (shared) | Yes (partitioned) |
| Compute isolation | No (best-effort) | Yes (guaranteed) |
| GPU support | All NVIDIA GPUs | A100, H100 only |
| Configuration | Simple ConfigMap | Node-level partitioning |
| Use case | Dev, light inference | Production, multi-tenant |
| Oversubscription | Possible (OOM risk) | Not possible |
Best practice: Monitor GPU memory usage closely when using time-slicing. Since there is no memory isolation, one pod can cause out-of-memory errors for all pods sharing the GPU. Set conservative replica counts based on your workload's memory footprint.
Lilly Tech Systems