GPU Management Best Practices Advanced

Managing a GPU fleet at scale requires systematic processes for maintenance, driver lifecycle management, hardware fault handling, and cost optimization. This lesson covers the operational best practices that keep GPU infrastructure reliable and cost-effective.

GPU Fleet Maintenance

Rolling driver updates — Update GPU drivers one node at a time, using Kubernetes cordoning and draining to avoid disrupting workloads
DCGM diagnostics schedule — Run Level 2 diagnostics weekly on each node during maintenance windows
Firmware updates — Track GPU firmware versions and schedule updates with NVIDIA's recommended cadence
Cooling system monitoring — Monitor ambient temperature and fan speeds; thermal issues affect entire racks

Hardware Fault Handling

Fault Type	Detection	Response
Correctable ECC errors	DCGM counter threshold	Monitor trend; schedule replacement if rate increases
Uncorrectable ECC errors	DCGM alert, XID 48	Immediately drain node and isolate GPU for RMA
NVLink failures	XID 74, bandwidth degradation	Remove from multi-GPU training pool; use for single-GPU workloads
Thermal throttling	Temperature exceeds 83C	Check cooling, reduce workload, inspect physical installation

Cost Optimization Strategies

Maximize utilization
Target 70%+ average GPU utilization. Use MIG and time-slicing to avoid leaving partial GPU capacity idle.
Right-size workloads
Monitor actual GPU memory usage versus allocated. Many inference workloads can run on smaller MIG slices.
Use spot instances
For fault-tolerant training with checkpointing, spot instances can save 60-70% on GPU costs.
Implement chargeback
Track GPU-hours per team and implement internal chargeback to incentivize efficient usage.
Power management
Use power capping during non-peak hours to reduce electricity costs without significantly affecting throughput.

Operational Runbooks

Maintain runbooks for common GPU operational scenarios:

GPU not detected — Check PCIe seating, driver version, kernel module loading
CUDA out of memory — Identify process consuming memory, check for memory leaks, restart if needed
Training performance degraded — Check NVLink health, thermal throttling, and competing workloads
Driver crash (XID 79) — Collect diagnostic data, reset GPU with nvidia-smi -r, escalate if recurring

Course Complete: You now have comprehensive knowledge of GPU monitoring and management, from nvidia-smi basics to enterprise DCGM deployment, Grafana dashboards, scheduling strategies, and operational best practices. Apply these skills to build and maintain efficient, reliable GPU infrastructure.

Continue Learning

Explore ML pipeline observability to monitor the full lifecycle of your machine learning workflows.

ML Pipeline Observability →

← Scheduling Course Overview →