nvidia-container-toolkit: After updating to latest Debian package, docker containers doesn't work

I upgraded nvidia-container-toolkit on my Debian 11.6 and suddenly my nvidia-enabled Docker containers didn’t work anymore. I start them using docker-compose. Here’s one of the docker-compose.yml files:

version: "3"
services:
  plex:
    container_name: plex
    restart: unless-stopped
    entrypoint:
      - /init
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all
      - TZ=CET
      - PLEX_CLAIM=claim-FXncUm-C8zdJzxxdBEEz
      - PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
      - TERM=xterm
      - LANG=C.UTF-8
      - LC_ALL=C.UTF-8
      - CHANGE_CONFIG_DIR_OWNERSHIP=true
      - HOME=/config
    expose:
      - 1900/udp
      - 3005/tcp
      - 32400/tcp
      - 32410/udp
      - 32412/udp
      - 32413/udp
      - 32414/udp
      - 32469/tcp
      - 8324/tcp
    hostname: plextower
    image: plexinc/pms-docker:plexpass
    #    image: plexinc/pms-docker:latest
    ipc: private
    logging:
      driver: json-file
      options: {}
    dns: 10.101.100.1
    networks:
      macvlan-plexdmz:
        ipv4_address: 10.101.100.200
        aliases:
          - plextower
    volumes:
      - /mnt/cache/appdata/pms-docker:/config
      - /mnt/cache/appdata/pms-docker-transcode:/tmp
      - /mnt/user/media:/data
networks:
  macvlan-plexdmz:
    external: true

Here’s the output I get after executing docker-compose up -d is:

Starting plex ... error

ERROR: for plex  Cannot start service plex: failed to create shim task: OCI runtime create failed: failed to create NVIDIA Container Runtime: failed to construct OCI spec modifier: failed to construct discoverer: failed to create Xorg discoverer: failed to locate libcuda.so: pattern libcuda.so.*.*.* not found: unknown

ERROR: for plex  Cannot start service plex: failed to create shim task: OCI runtime create failed: failed to create NVIDIA Container Runtime: failed to construct OCI spec modifier: failed to construct discoverer: failed to create Xorg discoverer: failed to locate libcuda.so: pattern libcuda.so.*.*.* not found: unknown
ERROR: Encountered errors while bringing up the project.

I have no clue whatsoever what’s wrong (although I guess the error messagesmakes sense if you know more about this stuff than me) and Google can’t help.

Thanks for any help!

/k

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 30 (14 by maintainers)

Commits related to this issue

Most upvoted comments

Same issue here after installing nvidia-container-toolkit=1.13.0-1 (note that I am working in a Jetson device)

docker run -it --rm --net=host --runtime nvidia -e DISPLAY=$DISPLAY -v /tmp/.X11-unix/:/tmp/.X11-unix nvcr.io/nvidia/l4t-base:r35.1.0
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: failed to create NVIDIA Container Runtime: failed to construct OCI spec modifier: failed to construct discoverer
: failed to create Xorg discoverer: failed to locate libcuda.so: pattern libcuda.so.*.*.* not found: unknown.

I temporary solved downgrading to 1.12.1-1

sudo apt install --allow-downgrades nvidia-container-toolkit=1.12.1-1 nvidia-container-toolkit-base=1.12.1-1

We have just published v1.13.1 of our NVIDIA Container Toolkit packages. These should include the fix for the issues you were experiencing.

Let me know if you’re still seeing problems.

For now, I will release 1.13.1 to fix the crash on Debian systems.

What would be useful would be the link chain for how libcuda.so.1 is resolved to the actual library (most likely /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.530.30.02).

I will also spin up a debian system to dig a bit further on my side, but this will be after the immediate crash is fixed.

@angudadevops the error message comes from us splitting the nvidia-container-toolkit package as part of the 1.11.0 release, but after the 1.11.0~rc1 that you have installed.

The simplest solution is to first uninstall the nvidia-contianer-toolkit=1.11.0~rc.1-1 package before installing the 1.13.1 version (not that 1.13.2 was released earlier this week).

Another note. The nvidia-container-runtime=3.13.0-1 is no longer required – and has not been for a number of NVIDIA Contianer Toolkit releases.

My recommendation is thus:

  • remove the nvidia-container-toolkit=1.11.0~rc.1-1 package
  • remote the nvidia-container-runtime* packages

Install the NVIDIA Container Toolkit:

sudo apt-get install nvidia-container-toolkit=1.13.2-1 nvidia-container-toolkit-base=1.13.2-1 libnvidia-container-tools=1.13.2-1 libnvidia-container1=1.13.2-1

@obarisk we have not yet published the packages to the CUDA Downloads Repos for debian due to some internal tooling that needs to change. You can use the steps described in our docs to install the packages from our GitHub-Pages repositories though.

@obarisk ok. Thanks for the confirmation. The issue you’re seeing is because you’re using the ubuntu packages. The only functional difference between the Ubuntu and Debian packages is the config file that refers to /sbin/ldconfig.real instead of /sbin/ldconfig. Installing the debian11 packages should address this.

@obarisk which package did you install?

The issue is that the ldconfig entry in the /etc/nvidia-container-runtime/config.toml seems incorrect for your distribution. Please replace /sbin/ldconfig.real with /sbin/ldconfig and try again.

We have work in progress to generate distribution-specific configs instead of relying on hard-coded values.

Great. Thanks for the confirmation. I will get the release out later this week or early next week.

@elezar worked great for me on Debian 11/Proxmox – thanks!

The MR has a short sha of 2136266d. To extract the packages built in this pipeline run:

$ docker run --rm -ti -v $(pwd):$(pwd) -u $(id -u):$(id -g) registry.gitlab.com/nvidia/container-toolkit/container-toolkit/staging/container-toolkit:2136266d-packaging cp -r /artifacts/packages/ubuntu18.04/amd64 $(pwd)/nvct-packages/

This will create an nvct-packages folder in the current folder:

$ tree nvct-packages
nvct-packages
├── libnvidia-container-dev_1.13.1-1_amd64.deb
├── libnvidia-container-tools_1.13.1-1_amd64.deb
├── libnvidia-container1-dbg_1.13.1-1_amd64.deb
├── libnvidia-container1_1.13.1-1_amd64.deb
├── nvidia-container-runtime_3.13.0-1_all.deb
├── nvidia-container-toolkit-base_1.13.1-1_amd64.deb
├── nvidia-container-toolkit-operator-extensions_1.13.1-1_amd64.deb
├── nvidia-container-toolkit_1.13.1-1_amd64.deb
└── nvidia-docker2_2.13.0-1_all.deb

The nvidia-container-toolkit-base_1.13.1-1_amd64.deb is the package that contains the nvidia-container-runtime binary with the fix for this issue.

Note that this is an ubunut18.04 package. This is compatible with all newer Debian-based systems, but may require a modification to the /etc/nvidia-container-runtime/config.toml.

On a Debian system the file should contain:

ldconfig = "@/sbin/ldconfig"

and not

ldconfig = "@/sbin/ldconfig.real"

I have been able to reproduce the behaviour on a test system and will continue working on improving things for Debian-based systesm.

If you want to build the packages yourself, you should be able to run:

./scripts/build-all-components.sh debian10-amd64

which will build all packages (as above) in the dist/debian10/amd64 folder. Note that the debian10 packages are compatible with debian11.

Hi @klausagnoletti. Thanks for the update. Could you let me know where libcuda.so.* is located on your system?

It would also be useful where the NVIDIA XOrg libraries are located (libglxserver_nvidia.so.*).

Regardless, I will push out a patch release that handes this error more gracefully.