gpu-operator: Failed to get sandbox runtime: no runtime for nvidia is configured

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn’t apply to you and add more information where it makes sense.

1. Quick Debug Checklist

Are you running on an Ubuntu 18.04 node?
Are you running Kubernetes v1.13+?
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
Do you have i2c_core and ipmi_msghandler loaded on the nodes?
Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

nov 02 18:00:58 beck containerd[10237]: time="2022-11-02T18:00:58.738797825+02:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:gpu-feature-discovery-qfjgk,Uid:02c7d4ad-db02-4145-846b-616a94416008,Namespace:gpu-operator,Attempt:2,} failed, error" error="failed to get sandbox runtime: no runtime for \"nvidia\" is configured"

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods --all-namespaces
kubernetes daemonset status: kubectl get ds --all-namespaces
If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
Output of running a container on the GPU machine: docker run -it alpine echo foo
Docker configuration file: cat /etc/docker/daemon.json
Docker runtime configuration: docker info | grep runtime
NVIDIA shared directory: ls -la /run/nvidia
NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
NVIDIA driver directory: ls -la /run/nvidia/driver
kubelet logs journalctl -u kubelet > kubelet.logs

(base) beck@beck:/$ ls -la /run/nvidia/
total 4
drwxr-xr-x  4 root root  100 nov  2 18:48 .
drwxr-xr-x 39 root root 1140 nov  2 18:47 ..
drwxr-xr-x  2 root root   40 nov  2 17:59 driver
-rw-r--r--  1 root root    7 nov  2 18:48 toolkit.pid
drwxr-xr-x  2 root root   80 nov  2 18:48 validations

Driver folder is empty:

(base) beck@beck:/$ ls -la /run/nvidia/driver/
total 0
drwxr-xr-x 2 root root 40 nov  2 17:59 .
drwxr-xr-x 4 root root 80 nov  2 18:48 ..

About this issue

Original URL
State: open
Created 2 years ago
Comments: 37 (9 by maintainers)

Commits related to this issue

hotfix: use kind node version with containerd 1.8 as kind has upgraded its containerd version to 1.9 which triggered issues to gpu-operator (see issue https://github.com/NVIDIA/gpu-operator/issues/43... — committed to FootprintAI/multikf by hsinhoyeh 2 years ago

Most upvoted comments

@denissabramovs this is a wild guess: are you using containerd 1.6.9? I believe we had problems with this version and the operator. We downgraded to containerd 1.6.8 and things started working again.

wjentner on Nov 2, 2022

Update: I followed the instructions here to install containerd using this method, and I believe the critical part is enabling systemd cgroup. Since doing this, I am able to schedule the pods and workloads.

msherm2 on Oct 13, 2023

There is another issue with containerd: https://github.com/containerd/containerd/issues/7843

if containerd is restared (version 1.6.9 and above), most pods are restarted, so together with nvidia container toolkit pod they end in endless restarting loop as toolkit tries to restart containerd which restarts the toolkit and driver and everything loops again. There is a fix for containerd, but it may not land yet everywhere.

@tuxtof, I think you are hitting exactly this issue.

xhejtman on Jan 10, 2023

If you won’t be able to reproduce, please ping me and i’ll try to reproduce it locally again. Then we could catch that issue and possibly make some patch together. In any case, thank you guys.

Bec-k on Nov 2, 2022

Update: I followed the instructions here to install containerd using this method, and I believe the critical part is enabling systemd cgroup. Since doing this, I am able to schedule the pods and workloads.

Totally unrelated to the gpu operator, but this fixed my problem with getting the spin wasm shim working on a Rocky 8 cluster. Many thanks!

hotspoons on Oct 23, 2023

@msherm2 did you configure the container-toolkit correctly for RKE2 as documented here?

toolkit:
   env:
   - name: CONTAINERD_CONFIG
     value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
   - name: CONTAINERD_SOCKET
     value: /run/k3s/containerd/containerd.sock
   - name: CONTAINERD_RUNTIME_CLASS
     value: nvidia
   - name: CONTAINERD_SET_AS_DEFAULT
     value: "true"

shivamerla on Oct 12, 2023

Hi @denissabramovs @wjentner. We just released v22.9.1. This includes the workaround mentioned above for resolving the containerd issues. Please give it a try and let us know if there are any issues.

cdesiniotis on Dec 14, 2022

Issue diagnosed and workaround MR can be found here: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/568

klueska on Nov 8, 2022

Oh wow! @wjentner you actually were right, i have re-enabled above mentioned toolkit and after downgrade, it finished without problems and all pods are up and running now!

Bec-k on Nov 2, 2022

you can disable toolkit as well by editing kubectl edit clusterpolicy and setting toolkit.enabled=false. Looks like you have nvidia-container-runtime already configured on the host and containerd config updated manually?

shivamerla on Nov 2, 2022