Enforcing Docker & Nvidia Container Toolkit Persistence

My previous post covered enabling Nvidia Container Toolkit with a Ubuntu VM (Ubuntu 24.04.2 LTS) and after running it for a few weeks I noticed the following:

  • The nvidia driver toolkit did not persist during a reboot without running nvidia-smi
  • The nvidia container toolkit seemed to disconnect the GPU from containers after 24-48 hours

Both of the above caused issues with GPU-usage from the containers to be disabled, e.g. when attempting to transcode using jellyfin.

Add Kernel parameter via Grub

The following parameter boots the nvidia gpu accordingly as part of start-up.

  1. Check to see if you have default Grub file cat /etc/default/grub
  2. Add the following to the file /etc/default/grub it should be in the section with GRUB_CMDLINE_LINUX using something like sudo vi /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash systemd.unified_cgroup_hierarchy=0"
  1. After saving the file, update grub with sudo update-grub

Adding Nvidia /dev:/dev to Docker Compose

I also updated the Docker Compose for all GPU related containers, example below is the runtime, devices, and resources added for Jellyfin.

Doing this ensures all devices being used/referenced by Docker Containers from the Host's Nvidia runtimes.

    runtime: nvidia
    devices:
      - '/dev/nvidia-caps:/dev/nvidia-caps'
      - '/dev/nvidia0:/dev/nvidia0'
      - '/dev/nvidiactl:/dev/nvidiactl'
      - '/dev/nvidia-modeset:/dev/nvidia-modeset'
      - '/dev/nvidia-uvm:/dev/nvidia-uvm'
      - '/dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu
CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected · Issue #666 · HaveAGitGat/Tdarr
Now that I finally converted all of my videos using Tdarr I like how I can leave it running and when new videos are downloaded it compresses them. And this works... But then it dies with “CUDA_ERRO…
NOTICE: Containers losing access to GPUs with error: “Failed to initialize NVML: Unknown Error” · Issue #48 · NVIDIA/nvidia-container-toolkit
1. Executive summary Under specific conditions, it’s possible that containers may be abruptly detached from the GPUs they were initially connected to. We have determined the root cause of this issu…
NVIDIA GPU | Jellyfin
This tutorial guides you on setting up full video hardware acceleration on NVIDIA GPU via NVENC.