Intermediate
CUDA & Driver Management
Install and manage NVIDIA drivers, CUDA toolkit, cuDNN, and the NVIDIA Container Toolkit across GPU server fleets with Ansible roles.
NVIDIA Driver Installation Role
---
# roles/nvidia-driver/tasks/main.yml
- name: Add NVIDIA CUDA repository key
apt_key:
url: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
state: present
- name: Add NVIDIA CUDA repository
apt_repository:
repo: "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 /"
state: present
- name: Install NVIDIA driver
apt:
name: "nvidia-driver-{{ nvidia_driver_version | default('550') }}"
state: present
notify: Reboot server
- name: Install CUDA toolkit
apt:
name: "cuda-toolkit-{{ cuda_version | default('12-4') }}"
state: present
- name: Install cuDNN
apt:
name: "libcudnn8={{ cudnn_version | default('8.9.*') }}"
state: present
- name: Verify GPU detection
command: nvidia-smi
register: nvidia_smi_output
changed_when: false
- name: Display GPU info
debug:
msg: "{{ nvidia_smi_output.stdout_lines[:5] }}"
Container Runtime Setup
---
# roles/nvidia-container/tasks/main.yml
- name: Install Docker
apt:
name: docker.io
state: present
- name: Add NVIDIA Container Toolkit repo
apt_repository:
repo: "deb https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /"
- name: Install NVIDIA Container Toolkit
apt:
name: nvidia-container-toolkit
state: present
- name: Configure Docker runtime
copy:
dest: /etc/docker/daemon.json
content: |
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
notify: Restart Docker
Version pinning: Always pin driver and CUDA versions using Ansible variables. Mismatched versions across your fleet can cause hard-to-debug training failures. Test version upgrades on a single node before rolling out fleet-wide.
Lilly Tech Systems