gpu-operator: nvidia-device-plugin-daemonset toolkit validation fails with containerd
1. Quick Debug Checklist
- Are you running on an Ubuntu 18.04 node?
- Are you running Kubernetes v1.13+?
- Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- Do you have
i2c_core
andipmi_msghandler
loaded on the nodes? - Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
)
1. Issue or feature description
I am trying to use gpu operator in a kubernetes cluster created using cluster-api
for azure. As I installed the operator, I’m running into an issue where the nvidia-device-plugin-daemonset
fails to come up, and crashes in init container which tries to run a validation pod. On further inspection, I noticed that it was failing with ImageInspectError
. The event log:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned gpu-operator-resources/nvidia-device-plugin-daemonset-f99md to cl-gpu-md-0-f4gm6
Warning InspectFailed 10m (x3 over 10m) kubelet Failed to inspect image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2": rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
Warning Failed 10m (x3 over 10m) kubelet Error: ImageInspectError
Normal Pulling 9m57s kubelet Pulling image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2"
Normal Pulled 9m53s kubelet Successfully pulled image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2"
Normal Created 9m8s (x4 over 9m53s) kubelet Created container toolkit-validation
Normal Started 9m8s (x4 over 9m53s) kubelet Started container toolkit-validation
Normal Pulled 9m8s (x3 over 9m52s) kubelet Container image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2" already present on machine
Warning BackOff 10s (x45 over 9m51s) kubelet Back-off restarting failed container
PS: I’m using containerd
for container management
The vm type is azure ncv3 series
2. Steps to reproduce the issue
- Create k8s cluster with one worker node with nvidia gpu
- Once the nodes are ready, install the nvidia gpu operator using
helm install --wait --generate-name nvidia/gpu-operator --set operator.defaultRuntime=containerd
- Observe
kubectl -n gpu-operator-resources get pods
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 27 (11 by maintainers)
@shysank will double check with @klueska and confirm on this.
No, after operator is un-installed, toolkit will reset it back to runc in config.toml. You can confirm this happening and restart containerd just to be sure it takes effect.