gpu-operator: Failed to get sandbox runtime: no runtime for nvidia is configured
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn’t apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- Are you running on an Ubuntu 18.04 node?
- Are you running Kubernetes v1.13+?
- Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- Do you have
i2c_core
andipmi_msghandler
loaded on the nodes? - Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
)
1. Issue or feature description
nov 02 18:00:58 beck containerd[10237]: time="2022-11-02T18:00:58.738797825+02:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:gpu-feature-discovery-qfjgk,Uid:02c7d4ad-db02-4145-846b-616a94416008,Namespace:gpu-operator,Attempt:2,} failed, error" error="failed to get sandbox runtime: no runtime for \"nvidia\" is configured"
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
-
kubernetes pods status:
kubectl get pods --all-namespaces
-
kubernetes daemonset status:
kubectl get ds --all-namespaces
-
If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
-
If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
-
Output of running a container on the GPU machine:
docker run -it alpine echo foo
-
Docker configuration file:
cat /etc/docker/daemon.json
-
Docker runtime configuration:
docker info | grep runtime
-
NVIDIA shared directory:
ls -la /run/nvidia
-
NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
-
NVIDIA driver directory:
ls -la /run/nvidia/driver
-
kubelet logs
journalctl -u kubelet > kubelet.logs
(base) beck@beck:/$ ls -la /run/nvidia/
total 4
drwxr-xr-x 4 root root 100 nov 2 18:48 .
drwxr-xr-x 39 root root 1140 nov 2 18:47 ..
drwxr-xr-x 2 root root 40 nov 2 17:59 driver
-rw-r--r-- 1 root root 7 nov 2 18:48 toolkit.pid
drwxr-xr-x 2 root root 80 nov 2 18:48 validations
Driver folder is empty:
(base) beck@beck:/$ ls -la /run/nvidia/driver/
total 0
drwxr-xr-x 2 root root 40 nov 2 17:59 .
drwxr-xr-x 4 root root 80 nov 2 18:48 ..
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 37 (9 by maintainers)
@denissabramovs this is a wild guess: are you using containerd 1.6.9? I believe we had problems with this version and the operator. We downgraded to containerd 1.6.8 and things started working again.
Update: I followed the instructions here to install containerd using this method, and I believe the critical part is enabling systemd cgroup. Since doing this, I am able to schedule the pods and workloads.
There is another issue with containerd: https://github.com/containerd/containerd/issues/7843
if containerd is restared (version 1.6.9 and above), most pods are restarted, so together with nvidia container toolkit pod they end in endless restarting loop as toolkit tries to restart containerd which restarts the toolkit and driver and everything loops again. There is a fix for containerd, but it may not land yet everywhere.
@tuxtof, I think you are hitting exactly this issue.
If you won’t be able to reproduce, please ping me and i’ll try to reproduce it locally again. Then we could catch that issue and possibly make some patch together. In any case, thank you guys.
Totally unrelated to the gpu operator, but this fixed my problem with getting the spin wasm shim working on a Rocky 8 cluster. Many thanks!
@msherm2 did you configure the container-toolkit correctly for RKE2 as documented here?
Hi @denissabramovs @wjentner. We just released v22.9.1. This includes the workaround mentioned above for resolving the containerd issues. Please give it a try and let us know if there are any issues.
Issue diagnosed and workaround MR can be found here: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/568
Oh wow! @wjentner you actually were right, i have re-enabled above mentioned toolkit and after downgrade, it finished without problems and all pods are up and running now!
you can disable toolkit as well by editing
kubectl edit clusterpolicy
and settingtoolkit.enabled=false
. Looks like you have nvidia-container-runtime already configured on the host and containerd config updated manually?