kata-containers: failed to detect NVIDIA devices in kata container
Description of problem
I followed this doc try to use Nvidia driver in kata-container with pass-through mode, but can not load the nvidia kernel module because get some error in prestart hooks. prestart start command log:
-- WARNING, the following logs are for debugging purposes only --
I0506 06:40:20.504472 119 nvc.c:376] initializing library context (version=1.9.0, build=5e135c17d6dbae861ec343e9a8d3a0d2af758a4f)
I0506 06:40:20.504520 119 nvc.c:350] using root /
I0506 06:40:20.504523 119 nvc.c:351] using ldcache /etc/ld.so.cache
I0506 06:40:20.504526 119 nvc.c:352] using unprivileged user 65534:65534
I0506 06:40:20.504537 119 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0506 06:40:20.504587 119 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0506 06:40:20.505563 119 nvc.c:258] failed to detect NVIDIA devices
I0506 06:40:20.505715 123 nvc.c:278] loading kernel module nvidia
E0506 06:40:20.506795 123 nvc.c:280] could not load kernel module nvidia
I0506 06:40:20.506799 123 nvc.c:296] loading kernel module nvidia_uvm
E0506 06:40:20.507786 123 nvc.c:298] could not load kernel module nvidia_uvm
I0506 06:40:20.507790 123 nvc.c:305] loading kernel module nvidia_modeset
E0506 06:40:20.508728 123 nvc.c:307] could not load kernel module nvidia_modeset
I0506 06:40:20.508904 124 rpc.c:71] starting driver rpc service
I0506 06:40:20.518913 119 rpc.c:135] driver rpc service terminated with signal 15
I0506 06:40:20.518945 119 nvc.c:430] shutting down library context
I installed the GPU driver and nvidia-container-toolkit in Guest OS correctly and enable the guest_hook_path in Kata’s configuration.toml guest_hook_path = "/usr/share/oci/hooks",
here is the prestart script:
/usr/bin/nvidia-container-toolkit -config="/etc/nvidia-container-runtime/config.toml" prestart
nvidia-container-runtime config:
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
accept-nvidia-visible-devices-as-volume-mounts = true
[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
debug = "/tmp/nvidia-container-toolkit.log"
ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
#ldconfig = "@/sbin/ldconfig"
[nvidia-container-runtime]
experimental = true
debug = "/tmp/nvidia-container-runtime.log"
run nvidia-smi in container:
root@kata-79cf45d755-6jzc2:/# nvidia-smi
bash: nvidia-smi: command not found
and can not find any nvidia module file in container.
run nvidia-smi -L in VM:
root@kata-79cf45d755-6jzc2:/tmp# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-a6f07a02-ad51-6e97-025e-8114c2606f4e)
I run command nvidia-container-cli --load-kmods --debug=/tmp/zxt-dd.log configure --ldconfig="@/sbin/ldconfig" --device==all --compute --utility --pid=105 /run/kata-containers/45b3f6083a2e812428bfa3a4ec0af1e94ad3a6e0840c2bad551ff380a6a682fe/rootfs to load module directly inVM, and get the log as follow:
-- WARNING, the following logs are for debugging purposes only --
I0506 10:14:19.929104 218 nvc.c:376] initializing library context (version=1.9.0, build=5e135c17d6dbae861ec343e9a8d3a0d2af758a4f)
I0506 10:14:19.929237 218 nvc.c:350] using root /
I0506 10:14:19.929247 218 nvc.c:351] using ldcache /etc/ld.so.cache
I0506 10:14:19.929256 218 nvc.c:352] using unprivileged user 65534:65534
I0506 10:14:19.929287 218 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0506 10:14:19.929435 218 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I0506 10:14:19.934837 219 nvc.c:278] loading kernel module nvidia
I0506 10:14:19.935190 219 nvc.c:282] running mknod for /dev/nvidiactl
I0506 10:14:19.935272 219 nvc.c:286] running mknod for /dev/nvidia0
I0506 10:14:19.935320 219 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I0506 10:14:19.948310 219 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0506 10:14:19.948482 219 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0506 10:14:19.951529 219 nvc.c:296] loading kernel module nvidia_uvm
I0506 10:14:19.951556 219 nvc.c:300] running mknod for /dev/nvidia-uvm
I0506 10:14:19.951671 219 nvc.c:305] loading kernel module nvidia_modeset
I0506 10:14:19.951686 219 nvc.c:309] running mknod for /dev/nvidia-modeset
I0506 10:14:19.952163 220 rpc.c:71] starting driver rpc service
I0506 10:14:20.617708 224 rpc.c:71] starting nvcgo rpc service
I0506 10:14:20.620537 218 nvc_container.c:240] configuring container with 'compute utility supervised'
I0506 10:14:20.620673 218 nvc.c:430] shutting down library context
I0506 10:14:20.620792 224 rpc.c:95] terminating nvcgo rpc service
I0506 10:14:20.621045 218 rpc.c:135] nvcgo rpc service terminated successfully
I0506 10:14:20.708399 220 rpc.c:95] terminating driver rpc service
I0506 10:14:20.708505 218 rpc.c:135] driver rpc service terminated successfully
Expected result
can use nvidia driver in kata container
Actual result
can not use nvidia driver in kata container
Further information
root@kata-79cf45d755-6jzc2:/tmp# uname -a
Linux kata-79cf45d755-6jzc2 5.10.25-nvidia-gpu #1 SMP Wed Apr 27 19:44:45 CST 2022 x86_64 GNU/Linux
I create the kata pod by applying k8s deploy YAML using nvidia-device-plugin instead of ctr command.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 57 (22 by maintainers)
You need to patch
kata-runtimeI am working on this (PR coming soon) ,The GPU is not working correctly at this point. You would need to build a new
kata-runtimewith the following changes:Can you please try with (as documented):
A prestart hook has more hidden arguments that are passed by the runtime not only the
prestartargument. The hook needs to know where the container artifacts are (json, rootfs) etc.Update the hook execution file and try again:
Another point is you need to use a container that
activatesthe hook. Try with the image I posted here.