gpu-operator: CUDA validators crashlooping while other cuda containers run fine

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn’t apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description:

The cuda validators in my cluster are in Init:CrashLoopBackOff while other cuda vector addition workloads run completely fine.

$ kubectl get po 
NAME                                       READY   STATUS                  RESTARTS        AGE
gpu-feature-discovery-cqjqx                1/1     Running                 0               28m
gpu-feature-discovery-x4qh2                1/1     Running                 0               28m
gpu-operator-77787587cf-cxnzl              1/1     Running                 0               28m
nvidia-container-toolkit-daemonset-bk28j   1/1     Running                 0               28m
nvidia-container-toolkit-daemonset-qvftc   1/1     Running                 0               28m
nvidia-cuda-validator-ccn65                0/1     Init:CrashLoopBackOff   5 (23s ago)     3m29s
nvidia-cuda-validator-p7sgd                0/1     Init:CrashLoopBackOff   5 (16s ago)     3m18s
nvidia-dcgm-exporter-p5wrc                 1/1     Running                 0               28m
nvidia-dcgm-exporter-rvnz6                 1/1     Running                 0               28m
nvidia-device-plugin-daemonset-2bfmt       1/1     Running                 0               28m
nvidia-device-plugin-daemonset-pvphw       1/1     Running                 0               28m
nvidia-operator-validator-5qfx2            0/1     Init:2/4                5 (4m55s ago)   28m
nvidia-operator-validator-nc9n6            0/1     Init:2/4                5 (4m40s ago)   28m

On closer inspection its the vector add that’s giving us an issue when getting the logs we get

$ kubectl logs nvidia-cuda-validator-ccn65 -c cuda-validation --previous 
Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!
[Vector addition of 50000 elements]

However, I am able to deploy other cudavector workloads for example

$ kubectl get po -n e2e-gpu-workload 
NAME                READY   STATUS             RESTARTS         AGE
cuda-vector-add     0/1     Completed          0                29m
cuda-vector-add-2   0/1     CrashLoopBackOff   10 (2m19s ago)   14m
cuda-vector-add-3   0/1     CrashLoopBackOff   6 (4m19s ago)    11m
cuda-vector-add-4   0/1     Completed          0                9m34s

and another look, printing out the container images for the running and non-running pods

$ kubectl get po -n e2e-gpu-workload -o custom-columns=CONTAINER:.spec.containers[0].name,IMAGE:.spec.containers[0].image
CONTAINER        IMAGE
cuda-vectoradd   nvidia/samples:vectoradd-cuda11.2.1
cuda-vectoradd   nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.11.1
cuda-vectoradd   docker.io/anjia0532/cuda-vector-add:v0.1
cuda-vectoradd   docker.io/anjia0532/cuda-vector-add:v0.1

when we look at the logs we find that everything works fine

$ kubectl logs -n e2e-gpu-workload cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

and expanding the pod definition we get

$ kubectl get po -n e2e-gpu-workload cuda-vector-add -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 9cbccd5849cd9f8a0b1670b84392deacfac4a55eaa05541ef83ab66df24de089
    cni.projectcalico.org/podIP: ""
    cni.projectcalico.org/podIPs: ""
  creationTimestamp: "2022-08-10T22:59:38Z"
  name: cuda-vector-add
  namespace: e2e-gpu-workload
  resourceVersion: "16028"
  uid: 17064311-b8a5-4ff3-bd4c-c3a9665d2ec4
spec:
  containers:
  - image: nvidia/samples:vectoradd-cuda11.2.1
    imagePullPolicy: IfNotPresent
    name: cuda-vectoradd
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-d7zbx
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeSelector:
    nvidia.com/gpu.count: "1"

2. Steps to reproduce the issue

I installed the operator with the following command:

helm install --wait --generate-name  --set nfd.enabled=false --set driver.enabled=false --set toolkit.version=1.6.0-centos7  nvidia/gpu-operator

In my setup I installed the nvidia device drivers and cuda drivers on the host using the runfile and am using the operator to install the container runtime.

3. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods --all-namespaces

  • kubernetes daemonset status: kubectl get ds --all-namespaces

  • If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

  • If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME

  • Output of running a container on the GPU machine: docker run -it alpine echo foo

  • Docker configuration file: cat /etc/docker/daemon.json

  • Docker runtime configuration: docker info | grep runtime

  • NVIDIA shared directory: ls -la /run/nvidia

  • NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit

  • NVIDIA driver directory: ls -la /run/nvidia/driver

  • kubelet logs journalctl -u kubelet > kubelet.logs

When I run nvidia-smi on the host i get the following, indicating that cuda is installed alright.

[root@ip-10-0-101-7 bin]# nvidia-smi
Wed Aug 10 23:40:28 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

as well as running a container with cuda

[root@ip-10-0-101-7 bin]# ctr run --rm --gpus 0 -t docker.io/nvidia/samples:vectoradd-cuda11.2.1 add2
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
[root@ip-10-0-101-7 bin]# ls -la /usr/local/nvidia/toolkit
total 8548
drwxr-xr-x. 3 root root    4096 Aug 10 23:04 .
drwxr-xr-x. 3 root root      21 Aug 10 23:04 ..
drwxr-xr-x. 3 root root      38 Aug 10 23:04 .config
lrwxrwxrwx. 1 root root      28 Aug 10 23:04 libnvidia-container.so.1 -> libnvidia-container.so.1.4.0
-rwxr-xr-x. 1 root root  179192 Aug 10 23:04 libnvidia-container.so.1.4.0
-rwxr-xr-x. 1 root root     154 Aug 10 23:04 nvidia-container-cli
-rwxr-xr-x. 1 root root   43024 Aug 10 23:04 nvidia-container-cli.real
-rwxr-xr-x. 1 root root     342 Aug 10 23:04 nvidia-container-runtime
-rwxr-xr-x. 1 root root     350 Aug 10 23:04 nvidia-container-runtime-experimental
-rwxr-xr-x. 1 root root 3991000 Aug 10 23:04 nvidia-container-runtime.experimental
lrwxrwxrwx. 1 root root      24 Aug 10 23:04 nvidia-container-runtime-hook -> nvidia-container-toolkit
-rwxr-xr-x. 1 root root 2359384 Aug 10 23:04 nvidia-container-runtime.real
-rwxr-xr-x. 1 root root     198 Aug 10 23:04 nvidia-container-toolkit
-rwxr-xr-x. 1 root root 2147896 Aug 10 23:04 nvidia-container-toolkit.real
[root@ip-10-0-101-7 bin]#

Any guidance would be greatly appreciated, and please let me know how I can help.

Thank you.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 4
  • Comments: 33 (14 by maintainers)

Most upvoted comments

I can confirm that updating to the latest NVIDIA GPU driver version (515.x) has corrected this issue. I apologize for the red herring.

@shivamerla it looks like the suggestion you gave in #391 worked i was able to get my validator image to work by removing the compat libs

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 684414b3c939cf65e1f48c78bd7d908231a52e51d97d4ce460613644c5bf2336
    cni.projectcalico.org/podIP: ""
    cni.projectcalico.org/podIPs: ""
  creationTimestamp: "2022-08-16T21:24:14Z"
  generateName: nvidia-cuda-validator-
  labels:
    app: nvidia-cuda-validator
  name: nvidia-cuda-validator-kwhd7
  namespace: default
  ownerReferences:
  - apiVersion: nvidia.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterPolicy
    name: cluster-policy
    uid: 31f3f689-d3b4-4e6f-a839-74d1ee9baf69
  resourceVersion: "92351"
  uid: 7763960b-1d58-4499-b71c-6bc5c4efefd4
spec:
  containers:
  - args:
    - echo cuda workload validation is successful
    command:
    - sh
    - -c
    image: docker.io/faiq/gpu-operator-validator:v1.11.0
    imagePullPolicy: IfNotPresent
    name: nvidia-cuda-validator
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-b8f59
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - args:
    - vectorAdd
    command:
    - sh
    - -c
    image: docker.io/faiq/gpu-operator-validator:v1.11.0
    imagePullPolicy: IfNotPresent
    name: cuda-validation
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-b8f59
      readOnly: true
  nodeName: ip-10-0-113-252.us-west-2.compute.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: nvidia-operator-validator
  serviceAccountName: nvidia-operator-validator
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-b8f59
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-08-16T21:24:16Z"
    reason: PodCompleted
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-08-16T21:24:14Z"
    reason: PodCompleted
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-08-16T21:24:14Z"
    reason: PodCompleted
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-08-16T21:24:14Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://784a53092f791b414e708bc9ec22d86e5a92a3988a86910d28fda8f06b7e250a
    image: docker.io/faiq/gpu-operator-validator:v1.11.0
    imageID: docker.io/faiq/gpu-operator-validator@sha256:0ec2a74daaaea3717a95d9399cbc6e9486c38f60e9b54b826c7031e3c2d52dfa
    lastState: {}
    name: nvidia-cuda-validator
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://784a53092f791b414e708bc9ec22d86e5a92a3988a86910d28fda8f06b7e250a
        exitCode: 0
        finishedAt: "2022-08-16T21:24:16Z"
        reason: Completed
        startedAt: "2022-08-16T21:24:16Z"
  hostIP: 10.0.113.252
  initContainerStatuses:
  - containerID: containerd://e25ffce27e57e170ac19cb12c390dce72f92400f04603d57ba2e54757dbbcc84
    image: docker.io/faiq/gpu-operator-validator:v1.11.0
    imageID: docker.io/faiq/gpu-operator-validator@sha256:0ec2a74daaaea3717a95d9399cbc6e9486c38f60e9b54b826c7031e3c2d52dfa
    lastState: {}
    name: cuda-validation
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://e25ffce27e57e170ac19cb12c390dce72f92400f04603d57ba2e54757dbbcc84
        exitCode: 0
        finishedAt: "2022-08-16T21:24:16Z"
        reason: Completed
        startedAt: "2022-08-16T21:24:15Z"
  phase: Succeeded
  podIP: 192.168.20.19
  podIPs:
  - ip: 192.168.20.19
  qosClass: BestEffort
  startTime: "2022-08-16T21:24:14Z"

Removing the compat libs worked

$ docker run --entrypoint=/bin/bash -it faiq/gpu-operator-validator:v1.11.0 
[root@eb0f8ba184e2 /]# ls -R /usr/local/cuda-11.6/
/usr/local/cuda-11.6/:
cuda-11.6  lib64  targets

/usr/local/cuda-11.6/targets:
x86_64-linux

/usr/local/cuda-11.6/targets/x86_64-linux:
lib

/usr/local/cuda-11.6/targets/x86_64-linux/lib:
libOpenCL.so.1	libOpenCL.so.1.0  libOpenCL.so.1.0.0  libcudart.so.11.0  libcudart.so.11.6.55
[root@eb0f8ba184e2 /]# 

When would the next timeline for release be?

OS Version: centos-7.9 GPU Operator Version Driver Pre-installed: yes Container-Toolkit Pre-installed: no Container-Toolkit Version: nvcr.io/nvidia/k8s/container-toolkit:v1.10.0-centos7 GPU Type: tesla k80

installed drivers

[root@ip-10-0-115-55 bin]# nvidia-smi
Fri Aug 12 17:27:02 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |