gpu-operator: nvidia-device-plugin-daemonset toolkit validation fails with containerd

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I am trying to use gpu operator in a kubernetes cluster created using cluster-api for azure. As I installed the operator, I’m running into an issue where the nvidia-device-plugin-daemonset fails to come up, and crashes in init container which tries to run a validation pod. On further inspection, I noticed that it was failing with ImageInspectError. The event log:

Events:
  Type     Reason         Age                   From               Message
  ----     ------         ----                  ----               -------
  Normal   Scheduled      10m                   default-scheduler  Successfully assigned gpu-operator-resources/nvidia-device-plugin-daemonset-f99md to cl-gpu-md-0-f4gm6
  Warning  InspectFailed  10m (x3 over 10m)     kubelet            Failed to inspect image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2": rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
  Warning  Failed         10m (x3 over 10m)     kubelet            Error: ImageInspectError
  Normal   Pulling        9m57s                 kubelet            Pulling image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2"
  Normal   Pulled         9m53s                 kubelet            Successfully pulled image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2"
  Normal   Created        9m8s (x4 over 9m53s)  kubelet            Created container toolkit-validation
  Normal   Started        9m8s (x4 over 9m53s)  kubelet            Started container toolkit-validation
  Normal   Pulled         9m8s (x3 over 9m52s)  kubelet            Container image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2" already present on machine
  Warning  BackOff        10s (x45 over 9m51s)  kubelet            Back-off restarting failed container

PS: I’m using containerd for container management

The vm type is azure ncv3 series

2. Steps to reproduce the issue

  1. Create k8s cluster with one worker node with nvidia gpu
  2. Once the nodes are ready, install the nvidia gpu operator using helm install --wait --generate-name nvidia/gpu-operator --set operator.defaultRuntime=containerd
  3. Observe kubectl -n gpu-operator-resources get pods

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 27 (11 by maintainers)

Most upvoted comments

@shysank will double check with @klueska and confirm on this.

Do you mean setting this -set operator.defaultRuntime=runc?

No, after operator is un-installed, toolkit will reset it back to runc in config.toml. You can confirm this happening and restart containerd just to be sure it takes effect.