gpu-operator: CUDA validators crashlooping while other cuda containers run fine
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn’t apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- Are you running on an Ubuntu 18.04 node?
- Are you running Kubernetes v1.13+?
- Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- Do you have
i2c_core
andipmi_msghandler
loaded on the nodes? - Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
)
1. Issue or feature description:
The cuda validators in my cluster are in Init:CrashLoopBackOff
while other cuda vector addition workloads run completely fine.
$ kubectl get po
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-cqjqx 1/1 Running 0 28m
gpu-feature-discovery-x4qh2 1/1 Running 0 28m
gpu-operator-77787587cf-cxnzl 1/1 Running 0 28m
nvidia-container-toolkit-daemonset-bk28j 1/1 Running 0 28m
nvidia-container-toolkit-daemonset-qvftc 1/1 Running 0 28m
nvidia-cuda-validator-ccn65 0/1 Init:CrashLoopBackOff 5 (23s ago) 3m29s
nvidia-cuda-validator-p7sgd 0/1 Init:CrashLoopBackOff 5 (16s ago) 3m18s
nvidia-dcgm-exporter-p5wrc 1/1 Running 0 28m
nvidia-dcgm-exporter-rvnz6 1/1 Running 0 28m
nvidia-device-plugin-daemonset-2bfmt 1/1 Running 0 28m
nvidia-device-plugin-daemonset-pvphw 1/1 Running 0 28m
nvidia-operator-validator-5qfx2 0/1 Init:2/4 5 (4m55s ago) 28m
nvidia-operator-validator-nc9n6 0/1 Init:2/4 5 (4m40s ago) 28m
On closer inspection its the vector add that’s giving us an issue when getting the logs we get
$ kubectl logs nvidia-cuda-validator-ccn65 -c cuda-validation --previous
Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!
[Vector addition of 50000 elements]
However, I am able to deploy other cudavector workloads for example
$ kubectl get po -n e2e-gpu-workload
NAME READY STATUS RESTARTS AGE
cuda-vector-add 0/1 Completed 0 29m
cuda-vector-add-2 0/1 CrashLoopBackOff 10 (2m19s ago) 14m
cuda-vector-add-3 0/1 CrashLoopBackOff 6 (4m19s ago) 11m
cuda-vector-add-4 0/1 Completed 0 9m34s
and another look, printing out the container images for the running and non-running pods
$ kubectl get po -n e2e-gpu-workload -o custom-columns=CONTAINER:.spec.containers[0].name,IMAGE:.spec.containers[0].image
CONTAINER IMAGE
cuda-vectoradd nvidia/samples:vectoradd-cuda11.2.1
cuda-vectoradd nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.11.1
cuda-vectoradd docker.io/anjia0532/cuda-vector-add:v0.1
cuda-vectoradd docker.io/anjia0532/cuda-vector-add:v0.1
when we look at the logs we find that everything works fine
$ kubectl logs -n e2e-gpu-workload cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
and expanding the pod definition we get
$ kubectl get po -n e2e-gpu-workload cuda-vector-add -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
cni.projectcalico.org/containerID: 9cbccd5849cd9f8a0b1670b84392deacfac4a55eaa05541ef83ab66df24de089
cni.projectcalico.org/podIP: ""
cni.projectcalico.org/podIPs: ""
creationTimestamp: "2022-08-10T22:59:38Z"
name: cuda-vector-add
namespace: e2e-gpu-workload
resourceVersion: "16028"
uid: 17064311-b8a5-4ff3-bd4c-c3a9665d2ec4
spec:
containers:
- image: nvidia/samples:vectoradd-cuda11.2.1
imagePullPolicy: IfNotPresent
name: cuda-vectoradd
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-d7zbx
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeSelector:
nvidia.com/gpu.count: "1"
2. Steps to reproduce the issue
I installed the operator with the following command:
helm install --wait --generate-name --set nfd.enabled=false --set driver.enabled=false --set toolkit.version=1.6.0-centos7 nvidia/gpu-operator
In my setup I installed the nvidia device drivers and cuda drivers on the host using the runfile and am using the operator to install the container runtime.
3. Information to attach (optional if deemed irrelevant)
-
kubernetes pods status:
kubectl get pods --all-namespaces
-
kubernetes daemonset status:
kubectl get ds --all-namespaces
-
If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
-
If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
-
Output of running a container on the GPU machine:
docker run -it alpine echo foo
-
Docker configuration file:
cat /etc/docker/daemon.json
-
Docker runtime configuration:
docker info | grep runtime
-
NVIDIA shared directory:
ls -la /run/nvidia
-
NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
-
NVIDIA driver directory:
ls -la /run/nvidia/driver
-
kubelet logs
journalctl -u kubelet > kubelet.logs
When I run nvidia-smi on the host i get the following, indicating that cuda is installed alright.
[root@ip-10-0-101-7 bin]# nvidia-smi
Wed Aug 10 23:40:28 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 35C P8 30W / 149W | 0MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
as well as running a container with cuda
[root@ip-10-0-101-7 bin]# ctr run --rm --gpus 0 -t docker.io/nvidia/samples:vectoradd-cuda11.2.1 add2
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
[root@ip-10-0-101-7 bin]# ls -la /usr/local/nvidia/toolkit
total 8548
drwxr-xr-x. 3 root root 4096 Aug 10 23:04 .
drwxr-xr-x. 3 root root 21 Aug 10 23:04 ..
drwxr-xr-x. 3 root root 38 Aug 10 23:04 .config
lrwxrwxrwx. 1 root root 28 Aug 10 23:04 libnvidia-container.so.1 -> libnvidia-container.so.1.4.0
-rwxr-xr-x. 1 root root 179192 Aug 10 23:04 libnvidia-container.so.1.4.0
-rwxr-xr-x. 1 root root 154 Aug 10 23:04 nvidia-container-cli
-rwxr-xr-x. 1 root root 43024 Aug 10 23:04 nvidia-container-cli.real
-rwxr-xr-x. 1 root root 342 Aug 10 23:04 nvidia-container-runtime
-rwxr-xr-x. 1 root root 350 Aug 10 23:04 nvidia-container-runtime-experimental
-rwxr-xr-x. 1 root root 3991000 Aug 10 23:04 nvidia-container-runtime.experimental
lrwxrwxrwx. 1 root root 24 Aug 10 23:04 nvidia-container-runtime-hook -> nvidia-container-toolkit
-rwxr-xr-x. 1 root root 2359384 Aug 10 23:04 nvidia-container-runtime.real
-rwxr-xr-x. 1 root root 198 Aug 10 23:04 nvidia-container-toolkit
-rwxr-xr-x. 1 root root 2147896 Aug 10 23:04 nvidia-container-toolkit.real
[root@ip-10-0-101-7 bin]#
Any guidance would be greatly appreciated, and please let me know how I can help.
Thank you.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 4
- Comments: 33 (14 by maintainers)
I can confirm that updating to the latest NVIDIA GPU driver version (515.x) has corrected this issue. I apologize for the red herring.
@shivamerla it looks like the suggestion you gave in #391 worked i was able to get my validator image to work by removing the compat libs
Removing the compat libs worked
When would the next timeline for release be?
OS Version: centos-7.9 GPU Operator Version Driver Pre-installed: yes Container-Toolkit Pre-installed: no Container-Toolkit Version: nvcr.io/nvidia/k8s/container-toolkit:v1.10.0-centos7 GPU Type: tesla k80
installed drivers