GPU Management Best Practices Advanced
Managing a GPU fleet at scale requires systematic processes for maintenance, driver lifecycle management, hardware fault handling, and cost optimization. This lesson covers the operational best practices that keep GPU infrastructure reliable and cost-effective.
GPU Fleet Maintenance
- Rolling driver updates — Update GPU drivers one node at a time, using Kubernetes cordoning and draining to avoid disrupting workloads
- DCGM diagnostics schedule — Run Level 2 diagnostics weekly on each node during maintenance windows
- Firmware updates — Track GPU firmware versions and schedule updates with NVIDIA's recommended cadence
- Cooling system monitoring — Monitor ambient temperature and fan speeds; thermal issues affect entire racks
Hardware Fault Handling
| Fault Type | Detection | Response |
|---|---|---|
| Correctable ECC errors | DCGM counter threshold | Monitor trend; schedule replacement if rate increases |
| Uncorrectable ECC errors | DCGM alert, XID 48 | Immediately drain node and isolate GPU for RMA |
| NVLink failures | XID 74, bandwidth degradation | Remove from multi-GPU training pool; use for single-GPU workloads |
| Thermal throttling | Temperature exceeds 83C | Check cooling, reduce workload, inspect physical installation |
Cost Optimization Strategies
- Maximize utilization
Target 70%+ average GPU utilization. Use MIG and time-slicing to avoid leaving partial GPU capacity idle.
- Right-size workloads
Monitor actual GPU memory usage versus allocated. Many inference workloads can run on smaller MIG slices.
- Use spot instances
For fault-tolerant training with checkpointing, spot instances can save 60-70% on GPU costs.
- Implement chargeback
Track GPU-hours per team and implement internal chargeback to incentivize efficient usage.
- Power management
Use power capping during non-peak hours to reduce electricity costs without significantly affecting throughput.
Operational Runbooks
Maintain runbooks for common GPU operational scenarios:
- GPU not detected — Check PCIe seating, driver version, kernel module loading
- CUDA out of memory — Identify process consuming memory, check for memory leaks, restart if needed
- Training performance degraded — Check NVLink health, thermal throttling, and competing workloads
- Driver crash (XID 79) — Collect diagnostic data, reset GPU with
nvidia-smi -r, escalate if recurring
Continue Learning
Explore ML pipeline observability to monitor the full lifecycle of your machine learning workflows.
ML Pipeline Observability →
Lilly Tech Systems