gpu-operator: Redeploy of Nvidia GPU operator fails upon upgrade to OpenShift 4.7.9

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn’t apply to you and add more information where it makes sense.

1. Quick Debug Checklist

[no ] Are you running on an Ubuntu 18.04 node?
[yes ] Are you running Kubernetes v1.13+?
[ yes] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
[ no] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
[ yes] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

We checked that the pods from gpu-operator-resources namespaces were stuck in init state and some pods were in crashloopbackoff state.
Checking logs of pod which were stuck in crashloopbackoff, we get to know that the pod is trying to check OpenShift version by querying API at 172.30.0.1:443. We tried running debug pod and checked curl command to 172.30.0.1:443 but connection was not happening due to proxy issue. From the curl command output, we observed that the NO_PROXY variable did not contain anything about service CIDR. Also, we checked proxy object but proxy object was correct.
After checking respective daemonset of those pods, we get to know that the ENV variables were not picked up correctly. We updated daemonset configuration by adding 172.30.0.1 in ENV variables of that container spec. After that the check to clusterversion with 172.30.0.1:443 happened successfully. But again there were some errors which are related to NVIDIA repo which are application specific.
We checked the other pods’ yaml which are stuck in init state. We get to know there are specification mentioned for init_containers. At node level, we tried checking logs of those init containers with crictl logs but those logs were again specific to application only.
I suggested you to check this problem with NVIDIA vendor first as most of the pods are stuck in init phase with no confirmed reason/error. In case, NVIDIA says something needs to be checked from OCP end then we can check it after you provide their analysis and exact error messages.
Currently,as you are going to check with NVIDIA vendor, I am keeping the case status as waiting on customer. Please update us in case you require help from OpenShift end.### 2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

[ x] kubernetes pods status: kubectl get pods --all-namespaces
[ x allds.txt allpods.txt nvidiapods.txt

] kubernetes daemonset status: kubectl get ds --all-namespaces

If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
Output of running a container on the GPU machine: docker run -it alpine echo foo
Docker configuration file: cat /etc/docker/daemon.json
Docker runtime configuration: docker info | grep runtime
NVIDIA shared directory: ls -la /run/nvidia
NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
NVIDIA driver directory: ls -la /run/nvidia/driver
kubelet logs journalctl -u kubelet > kubelet.logs

About this issue

Original URL
State: open
Created 3 years ago
Comments: 34 (18 by maintainers)

Most upvoted comments

You can let them know that these are public registries and no login is required: https://ngc.nvidia.com/catalog/containers/nvidia:driver

cc: @zvonkok @kpouget for help with RH ticket: 02949372

shivamerla on Jun 24, 2021

@kpouget
kind: ClusterServiceVersion name: gpu-operator-certified.v1.7.0 namespace: openshift-operators - apiVersion: operators.coreos.com/v1 kind: OperatorCondition name: gpu-operator-certified.v1.7.0 namespace: openshift-operators

damora on Jun 3, 2021