k8s-device-plugin: Getting nvidia-device-plugin container CrashLoopBackOff | version v0.14.0 | container runtime : containerd
Getting nvidia-device-plugin container CrashLoopBackOff error. Using K8-device-plugin version v0.14.0 and container runtime as containerd. Same is working fine with container runtime as dockerd.
Pod ErrorLog:
I0524 08:28:03.907585 1 main.go:256] Retreiving plugins.
W0524 08:28:03.908010 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0524 08:28:03.908084 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0524 08:28:03.908113 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0524 08:28:03.908121 1 factory.go:115] Incompatible platform detected
E0524 08:28:03.908130 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0524 08:28:03.908136 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0524 08:28:03.908142 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0524 08:28:03.908149 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0524 08:28:03.915664 1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed
nvidia-smi output:
sh-4.2$ nvidia-smi
Wed May 24 08:57:00 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 25C P8 9W / 70W| 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 33 (7 by maintainers)
Manual RuntimeClass creation in kubernetes cluster helped me.
Manifest:
Docs: https://kubernetes.io/docs/concepts/containers/runtime-class/
@simsicon
My daemonset:
I believe I have figured it out. At least in my case.
Most of the tutorials out there were suggesting a k3d template instead of the k3s template. I thought that was wrong and I assumed that the k3s service should “detect” the nvidia container runtime. It does but it does not make it the default one.
This template seems to work: https://github.com/skirsten/k3s/blob/f78a66b44e2ecbef64122be99a9aa9118a49d7e9/pkg/agent/templates/templates_linux.go#L10
But a simpler solution, in case you don’t want to force every pod to have nvidia runtime, is to add
runtimeClassName: nvidia
to the https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml and after that everything starts to work just fine.Are you ALSO specifying the
nvidia
runtime class for the device plugin containers?@zachfi since there is no
plugins."io.containerd.grpc.v1.cri".containerd.default_runtime_name = "nvidia"
entry, thenvidia
runtime is not the default runtime. As such you would also need to launch the Device Plugin specifyingruntimeClassname: nvidia
. This ensures that the containers of the device plugin are started using thenvidia-container-runtime
, injecting the required devices. (Please also confirm thatNVIDIA_VISIBLE_DEVICES = all
in this container too).The issue that the benchmark container is seeing is because tha device plugin is not reporting any
nvidia.com/gpu
resources to the Kubelet and as such the pod cannot be scheduled.