Skip to content

NVIDIA Drivers

The document refers to the installation of NVIDIA CUDA drivers and NVIDIA Docker on each node containing GPUs. For using the NVIDIA GPU Operator instead, see the Cluster Installation documentation.

If you are using DGX OS then NVIDIA prerequisites are already installed and you may skip this document

On each machine with GPUs run the following steps.

Step 1: Install the NVIDIA CUDA Toolkit

Run:

nvidia-smi

If the command is installed. Verify that the NVIDIA driver version 410.104 or later and CUDA 10 or later.

If the command is not successful, you must install the CUDA Toolkit. Follow the instructions here to install. When the installation is finished you must reboot your computer.

If the machine is DGX A100, then, depending on your operating system, you may have to install the NVIDIA Fabric Manager.

  • If the operating system is DGX OS. No installation is required
  • For other operating systems:
    • Run: nvidia-smi and get the NVIDIA Driver version (it must be 450 or later).
    • Run: sudo apt search fabricmanager to find a Fabric Manager package with the same version and install it.

Important

NVIDIA Fabric manager does not work on CoreOS. If you are working with OpenShift, use RHEL instead or use the NVIDIA GPU Operator rather than installing NVIDIA software on each machine.

Step 2: Install Docker

Install Docker by following the steps here: https://docs.docker.com/engine/install/. Specifically, you can use a convenience script provided in the document:

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

Step 3: NVIDIA Container Toolkit (previously named NVIDIA Docker)

To install NVIDIA Docker on Debian-based distributions (such as Ubuntu), run the following:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

For RHEL-based distributions, run:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

For a detailed review of the above instructions, see the NVIDIA Container Toolkit installation instructions.

Warning

Kubernetes does not currently support the NVIDIA container runtime, which is the successor of NVIDIA Docker/NVIDIA container toolkit.

Step 4: Make NVIDIA Docker the default docker runtime

Set the NVIDIA runtime as the default Docker runtime on your node. Edit the docker daemon config file at /etc/docker/daemon.json and add the default-runtime key as follows:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
Then run the following again:

sudo pkill -SIGHUP dockerd

Last update: August 26, 2021