GPU Management Best Practices Advanced

Managing a GPU fleet at scale requires systematic processes for maintenance, driver lifecycle management, hardware fault handling, and cost optimization. This lesson covers the operational best practices that keep GPU infrastructure reliable and cost-effective.

GPU Fleet Maintenance

  • Rolling driver updates — Update GPU drivers one node at a time, using Kubernetes cordoning and draining to avoid disrupting workloads
  • DCGM diagnostics schedule — Run Level 2 diagnostics weekly on each node during maintenance windows
  • Firmware updates — Track GPU firmware versions and schedule updates with NVIDIA's recommended cadence
  • Cooling system monitoring — Monitor ambient temperature and fan speeds; thermal issues affect entire racks

Hardware Fault Handling

Fault Type Detection Response
Correctable ECC errors DCGM counter threshold Monitor trend; schedule replacement if rate increases
Uncorrectable ECC errors DCGM alert, XID 48 Immediately drain node and isolate GPU for RMA
NVLink failures XID 74, bandwidth degradation Remove from multi-GPU training pool; use for single-GPU workloads
Thermal throttling Temperature exceeds 83C Check cooling, reduce workload, inspect physical installation

Cost Optimization Strategies

  1. Maximize utilization

    Target 70%+ average GPU utilization. Use MIG and time-slicing to avoid leaving partial GPU capacity idle.

  2. Right-size workloads

    Monitor actual GPU memory usage versus allocated. Many inference workloads can run on smaller MIG slices.

  3. Use spot instances

    For fault-tolerant training with checkpointing, spot instances can save 60-70% on GPU costs.

  4. Implement chargeback

    Track GPU-hours per team and implement internal chargeback to incentivize efficient usage.

  5. Power management

    Use power capping during non-peak hours to reduce electricity costs without significantly affecting throughput.

Operational Runbooks

Maintain runbooks for common GPU operational scenarios:

  • GPU not detected — Check PCIe seating, driver version, kernel module loading
  • CUDA out of memory — Identify process consuming memory, check for memory leaks, restart if needed
  • Training performance degraded — Check NVLink health, thermal throttling, and competing workloads
  • Driver crash (XID 79) — Collect diagnostic data, reset GPU with nvidia-smi -r, escalate if recurring
Course Complete: You now have comprehensive knowledge of GPU monitoring and management, from nvidia-smi basics to enterprise DCGM deployment, Grafana dashboards, scheduling strategies, and operational best practices. Apply these skills to build and maintain efficient, reliable GPU infrastructure.

Continue Learning

Explore ML pipeline observability to monitor the full lifecycle of your machine learning workflows.

ML Pipeline Observability →