libnvidia-container: Cannot set up cuda-compat driver when there's alternatives symlink
Host has CUDA 10.2 + 440 drv install, and I’m trying to use CUDA 11.1 + 455 drv in docker.
Surprisingly I found when I have cuda-toolkit-11-1
installed, cuda-compat-11-1
driver cannot be loaded and nvidia-smi shows 10.2 instead of 11.1.
Even with all dependencies of cuda-toolkit-11-1
installed, nvidia-smi still shows 11.1, and I found the bug is caused by update-alternatives
in post-install script of cuda-toolkit-11-1
.
% rpm --scripts -qlp ~/Downloads/cuda-toolkit-11-1-11.1.1-1.x86_64.rpm
warning: /Users/xkszltl/Downloads/cuda-toolkit-11-1-11.1.1-1.x86_64.rpm: Header V3 RSA/SHA512 Signature, key ID 7fa2af80: NOKEY
postuninstall scriptlet (using /bin/sh):
update-alternatives --remove cuda /usr/local/cuda-11.1
update-alternatives --remove cuda-11 /usr/local/cuda-11.1
posttrans scriptlet (using /bin/sh):
update-alternatives --install /usr/local/cuda cuda /usr/local/cuda-11.1 11
update-alternatives --install /usr/local/cuda-11 cuda-11 /usr/local/cuda-11.1 11
(contains no files)
Here’s a dockerfile repro that mimics update-alternatives
behavior with ln
:
FROM nvidia/cuda:11.1-base-centos7
RUN ln -sfT /usr/local/cuda-11.1 /etc/alternatives/cuda
RUN ln -sfT /etc/alternatives/cuda /usr/local/cuda
With or without 3rd line, results are different.
Without 3rd line
# cat Dockerfile && sudo docker build --pull --no-cache -t cuda_jump . && sudo docker run --rm -it --gpus all cuda_jump nvidia-smi
FROM nvidia/cuda:11.1-base-centos7
RUN ln -sfT /usr/local/cuda-11.1 /etc/alternatives/cuda
# RUN ln -sfT /etc/alternatives/cuda /usr/local/cuda
Sending build context to Docker daemon 6.868MB
Step 1/2 : FROM nvidia/cuda:11.1-base-centos7
11.1-base-centos7: Pulling from nvidia/cuda
Digest: sha256:759a04c1d9e59cc894889b4edae4684b07ac2f7d20214edf7cf7a43a80f35d22
Status: Image is up to date for nvidia/cuda:11.1-base-centos7
---> 165de1193617
Step 2/2 : RUN ln -sfT /usr/local/cuda-11.1 /etc/alternatives/cuda
---> Running in ca0b189bce0f
Removing intermediate container ca0b189bce0f
---> fcc95533128a
Successfully built fcc95533128a
Successfully tagged cuda_jump:latest
Wed Nov 4 19:04:23 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00001446:00:00.0 Off | 0 |
| N/A 41C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00002232:00:00.0 Off | 0 |
| N/A 42C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 0000317B:00:00.0 Off | 0 |
| N/A 40C P0 43W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00004A18:00:00.0 Off | 0 |
| N/A 42C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 000059F4:00:00.0 Off | 0 |
| N/A 40C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00007C99:00:00.0 Off | 0 |
| N/A 40C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 0000B32A:00:00.0 Off | 0 |
| N/A 39C P0 43W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 0000F250:00:00.0 Off | 0 |
| N/A 40C P0 46W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
With 3rd line
# cat Dockerfile && sudo docker build --pull --no-cache -t cuda_jump . && sudo docker run --rm -it --gpus all cuda_jump nvidia-smi
FROM nvidia/cuda:11.1-base-centos7
RUN ln -sfT /usr/local/cuda-11.1 /etc/alternatives/cuda
RUN ln -sfT /etc/alternatives/cuda /usr/local/cuda
Sending build context to Docker daemon 6.868MB
Step 1/3 : FROM nvidia/cuda:11.1-base-centos7
11.1-base-centos7: Pulling from nvidia/cuda
Digest: sha256:759a04c1d9e59cc894889b4edae4684b07ac2f7d20214edf7cf7a43a80f35d22
Status: Image is up to date for nvidia/cuda:11.1-base-centos7
---> 165de1193617
Step 2/3 : RUN ln -sfT /usr/local/cuda-11.1 /etc/alternatives/cuda
---> Running in 7894f6ff3d85
Removing intermediate container 7894f6ff3d85
---> feb25628ec57
Step 3/3 : RUN ln -sfT /etc/alternatives/cuda /usr/local/cuda
---> Running in e5792cbfc1b1
Removing intermediate container e5792cbfc1b1
---> 9de15a3f151c
Successfully built 9de15a3f151c
Successfully tagged cuda_jump:latest
Wed Nov 4 19:03:02 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00001446:00:00.0 Off | 0 |
| N/A 40C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00002232:00:00.0 Off | 0 |
| N/A 42C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 0000317B:00:00.0 Off | 0 |
| N/A 40C P0 43W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00004A18:00:00.0 Off | 0 |
| N/A 42C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 000059F4:00:00.0 Off | 0 |
| N/A 40C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00007C99:00:00.0 Off | 0 |
| N/A 40C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 0000B32A:00:00.0 Off | 0 |
| N/A 39C P0 43W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 0000F250:00:00.0 Off | 0 |
| N/A 40C P0 46W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 22 (11 by maintainers)
Commits related to this issue
- Use rel symlink for CUDA due to nvidia docker issue: https://github.com/NVIDIA/libnvidia-container/issues/117 — committed to xkszltl/Roaster by xkszltl 4 years ago
I think there is some confusion.
libnvidia-container
doesn’t look at the container image at all. It takes the nvidia driver libraries available on the host and bind mounts them into the container.The set of driver libraries it bind mounts are all of the form
libnvidia-*.so.<driver_version>
. Once bind-mounted, it runsldconfig
over these libraries to autogenerate thelibnvidia-*.so.1
symlinks to them inside the container.As far as I understand it, you have both CUDA 10.2 and the nvidia 440 driver installed on the host. The CUDA installation on the host is actually irrelevant, as
libnvidia-container
will only inject libraries from the 440 driver. The full set of (possible) libraries can be seen here (https://github.com/NVIDIA/libnvidia-container/blob/master/src/nvc_info.c#L75).With the container image you are using (i.e.
nvidia/cuda:11.1-base-centos7
), the libraries you will get injected are theutility_libs
and thecompute_libs
from this list, i.e.:This happens, because the container image sets the environment variable for
NVIDIA_DRIVER_CAPABILITIES=compute,utility
as described here: https://github.com/NVIDIA/nvidia-container-runtime#nvidia_driver_capabilitiesIf you install something inside the image that “overrides” these bind-mounted libraries from the host, then that is something completely out of
libnvidia-container
s control. It only takes what it sees on the host and injects it into the container – it does nothing with what is installed inside the container image.From what I understand, you are trying to install
cuda-toolkit-11-1
andcuda-compat-11-1
inside the container (the latter of which will attempt to override the bind-mounted libraries forlibcuda
andlibnvidia-ptxjitcompiler
).In theory, this should be possible, so long as you always run the container on a host with a 440+ driver installed (so you get the prerequisite 440 driver libraries bind-mounted into your container) and you ensure that the compat libraries “override” the ones injected from the 440 driver.
However, it is not up to
libnvidia-container
to ensure that this “override” is done properly inside the container.libnvidia-container
simply injects what it sees on the host, and nothing more.libnvidia-container
v1.3.1 has now been released. Please try it out and confirm that it resolves your problem.The solution is actually this:
Oh right, yes,
libnvidia-container
is the one who runsldconfig
from within the container.And I actually misspoke before. I apologise.
linvidia-container
does, in fact, inspect the container image for thecompat
libraries. It doesn’t look at the image for any other libraries, but it special cases thecompat
libraries so that it can pull them up into standard library paths and make sure they are loaded when it makes its call toldconfig
.I had forgotten about this detail, and apologise again for the confusion.
What this means, however, is that the issue you are reporting here is, in fact a bug in
libnvidia-container
.libnvidia-container
is hard coded to look under/usr/local/cuda/compat/
for the compatibility libraries. So long as this path only points through relative symlinks,libnvidia-container
is able to resolve the libs contained underneath it. This is true, for example, when the symlink set up is/usr/local/cuda --> ./cuda-11.1
.However,
libnvidia-container
is not able to resolve absolute symlinks inside the container (e.g./usr/local/cuda --> /etc/etc/alternatives/cuda
) because it doesn’t actually inspect the symlink and prepend therootfs
of wherever docker has unpacked the container image. This results inlibnvidia-container
thinking that/usr/local/cuda --> /etc/etc/alternatives/cuda
is a broken symlink and stopping the search for thecompat
libraries down that path.The relevant code is here: https://github.com/NVIDIA/libnvidia-container/blob/master/src/nvc_container.c#L195
This ultimately results in
libnvidia-container
creating a blank file for thecompat
lib, rather than copying it overYou can see this in your (broken) example with:
vs. the working example with:
Also, manually running
ldconfig
in the broken example will result in:I don’t have a good fix for this off of the top of my head, but at least we’ve gotten to the root of the problem, and identified that it is in fact an issue with
libnvidia-container
.Thanks for pushing me to explain this in detail, or I never would have gotten to the root cause.
We will work on a fix for this in the next release.