Intermediate

Multi-Instance GPU (MIG)

Partition NVIDIA A100 and H100 GPUs into hardware-isolated instances with guaranteed compute, memory bandwidth, and fault isolation for multi-tenant Kubernetes clusters.

Understanding MIG

Multi-Instance GPU (MIG) is a hardware-level feature available on NVIDIA A100, A30, and H100 GPUs that allows a single GPU to be partitioned into up to seven independent instances. Each instance has dedicated compute units, memory, and memory bandwidth — providing true hardware isolation.

💡

Hardware isolation: Unlike time-slicing, MIG provides guaranteed resources. A workload running on a MIG instance cannot be affected by other workloads on the same GPU — no memory contention, no compute interference, and faults are isolated.

MIG Partition Profiles

The A100 80GB supports these MIG profiles:

Profile	GPU Memory	Compute (SMs)	Max Instances
1g.10gb	10 GB	1/7	7
2g.20gb	20 GB	2/7	3
3g.40gb	40 GB	3/7	2
4g.40gb	40 GB	4/7	1
7g.80gb	80 GB	7/7	1

Enabling MIG in Kubernetes

Configure MIG using the GPU Operator's MIG manager:

# Enable MIG mode on the GPU (requires node drain)
sudo nvidia-smi -i 0 -mig 1

# Create MIG instances (example: 3 x 2g.20gb + 1 x 1g.10gb)
sudo nvidia-smi mig -i 0 -cgi 9,9,9,19 -C

# Verify MIG instances
nvidia-smi mig -lgi

Kubernetes MIG Strategies

The NVIDIA device plugin supports three MIG strategies:

none: MIG is disabled. GPUs are exposed as whole devices.
single: All MIG instances on a GPU must be the same profile. Resources are named nvidia.com/gpu.
mixed: Different MIG profiles can coexist. Resources are named by profile, e.g., nvidia.com/mig-2g.20gb.

# Pod requesting a specific MIG profile
apiVersion: v1
kind: Pod
metadata:
  name: inference-small
spec:
  containers:
  - name: model
    image: nvcr.io/nvidia/pytorch:24.01-py3
    resources:
      limits:
        nvidia.com/mig-1g.10gb: 1  # Request a 1g.10gb MIG slice

✅

Production tip: Use the mixed MIG strategy to run different workload sizes on the same GPU. Allocate larger profiles (3g.40gb) for training and smaller profiles (1g.10gb) for inference, maximizing GPU utilization across workload types.

← Previous Time-Slicing Next → Scheduling Policies