Intermediate

CUDA & Driver Management

Install and manage NVIDIA drivers, CUDA toolkit, cuDNN, and the NVIDIA Container Toolkit across GPU server fleets with Ansible roles.

NVIDIA Driver Installation Role

---
# roles/nvidia-driver/tasks/main.yml
- name: Add NVIDIA CUDA repository key
  apt_key:
    url: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
    state: present

- name: Add NVIDIA CUDA repository
  apt_repository:
    repo: "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 /"
    state: present

- name: Install NVIDIA driver
  apt:
    name: "nvidia-driver-{{ nvidia_driver_version | default('550') }}"
    state: present
  notify: Reboot server

- name: Install CUDA toolkit
  apt:
    name: "cuda-toolkit-{{ cuda_version | default('12-4') }}"
    state: present

- name: Install cuDNN
  apt:
    name: "libcudnn8={{ cudnn_version | default('8.9.*') }}"
    state: present

- name: Verify GPU detection
  command: nvidia-smi
  register: nvidia_smi_output
  changed_when: false

- name: Display GPU info
  debug:
    msg: "{{ nvidia_smi_output.stdout_lines[:5] }}"

Container Runtime Setup

---
# roles/nvidia-container/tasks/main.yml
- name: Install Docker
  apt:
    name: docker.io
    state: present

- name: Add NVIDIA Container Toolkit repo
  apt_repository:
    repo: "deb https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /"

- name: Install NVIDIA Container Toolkit
  apt:
    name: nvidia-container-toolkit
    state: present

- name: Configure Docker runtime
  copy:
    dest: /etc/docker/daemon.json
    content: |
      {
        "default-runtime": "nvidia",
        "runtimes": {
          "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
          }
        }
      }
  notify: Restart Docker
💡
Version pinning: Always pin driver and CUDA versions using Ansible variables. Mismatched versions across your fleet can cause hard-to-debug training failures. Test version upgrades on a single node before rolling out fleet-wide.