k8s-device-plugin: Getting nvidia-device-plugin container CrashLoopBackOff | version v0.14.0 | container runtime : containerd

Getting nvidia-device-plugin container CrashLoopBackOff error. Using K8-device-plugin version v0.14.0 and container runtime as containerd. Same is working fine with container runtime as dockerd.

Pod ErrorLog:

I0524 08:28:03.907585       1 main.go:256] Retreiving plugins.
W0524 08:28:03.908010       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0524 08:28:03.908084       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0524 08:28:03.908113       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0524 08:28:03.908121       1 factory.go:115] Incompatible platform detected
E0524 08:28:03.908130       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0524 08:28:03.908136       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0524 08:28:03.908142       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0524 08:28:03.908149       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0524 08:28:03.915664       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed

nvidia-smi output:

sh-4.2$ nvidia-smi
Wed May 24 08:57:00 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                        On | 00000000:00:1E.0 Off |                    0 |
| N/A   25C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 33 (7 by maintainers)

Most upvoted comments

Manual RuntimeClass creation in kubernetes cluster helped me.

Manifest:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia 
handler: nvidia 

Docs: https://kubernetes.io/docs/concepts/containers/runtime-class/

@simsicon

My daemonset:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvdp-nvidia-device-plugin
  namespace: nvidia-device-plugin
  labels:
    app.kubernetes.io/instance: nvdp
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: nvidia-device-plugin
    app.kubernetes.io/version: 0.14.0
    helm.sh/chart: nvidia-device-plugin-0.14.0
  annotations:
    deprecated.daemonset.template.generation: '1'
    meta.helm.sh/release-name: nvdp
    meta.helm.sh/release-namespace: nvidia-device-plugin
spec:
  selector:
    matchLabels:
      app.kubernetes.io/instance: nvdp
      app.kubernetes.io/name: nvidia-device-plugin
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: nvdp
        app.kubernetes.io/name: nvidia-device-plugin
    spec:
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
            type: ''
      containers:
        - name: nvidia-device-plugin-ctr
          image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
          env:
            - name: NVIDIA_MIG_MONITOR_DEVICES
              value: all
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node.kubernetes.io/gpu-node
                    operator: In
                    values:
                      - 'true'
      schedulerName: default-scheduler
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
        - key: node.kubernetes.io/gpu-node
          operator: Equal
          value: 'true'
          effect: NoSchedule
      priorityClassName: system-node-critical
      runtimeClassName: nvidia
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 0

I believe I have figured it out. At least in my case.

If the config.toml directly modified, after restarting k3s service config.toml will be over written, to avoid this use config.toml.tmpl file. sample URL :- https://github.com/skirsten/k3s/blob/f78a66b44e2ecbef64122be99a9aa9118a49d7e9/pkg/agent/templates/templates_linux.go#L10 and add default_runtime_name = “nvidia”

Most of the tutorials out there were suggesting a k3d template instead of the k3s template. I thought that was wrong and I assumed that the k3s service should “detect” the nvidia container runtime. It does but it does not make it the default one.

This template seems to work: https://github.com/skirsten/k3s/blob/f78a66b44e2ecbef64122be99a9aa9118a49d7e9/pkg/agent/templates/templates_linux.go#L10

But a simpler solution, in case you don’t want to force every pod to have nvidia runtime, is to add runtimeClassName: nvidia to the https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml and after that everything starts to work just fine.

Are you ALSO specifying the nvidia runtime class for the device plugin containers?

@zachfi since there is no plugins."io.containerd.grpc.v1.cri".containerd.default_runtime_name = "nvidia" entry, the nvidia runtime is not the default runtime. As such you would also need to launch the Device Plugin specifying runtimeClassname: nvidia. This ensures that the containers of the device plugin are started using the nvidia-container-runtime, injecting the required devices. (Please also confirm that NVIDIA_VISIBLE_DEVICES = all in this container too).

The issue that the benchmark container is seeing is because tha device plugin is not reporting any nvidia.com/gpu resources to the Kubelet and as such the pod cannot be scheduled.