gpu-operator: Redeploy of Nvidia GPU operator fails upon upgrade to OpenShift 4.7.9
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn’t apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- [no ] Are you running on an Ubuntu 18.04 node?
- [yes ] Are you running Kubernetes v1.13+?
- [ yes] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ no] Do you have
i2c_core
andipmi_msghandler
loaded on the nodes? - [ yes] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
)
1. Issue or feature description
- We checked that the pods from gpu-operator-resources namespaces were stuck in init state and some pods were in crashloopbackoff state.
- Checking logs of pod which were stuck in crashloopbackoff, we get to know that the pod is trying to check OpenShift version by querying API at 172.30.0.1:443. We tried running debug pod and checked curl command to 172.30.0.1:443 but connection was not happening due to proxy issue. From the curl command output, we observed that the NO_PROXY variable did not contain anything about service CIDR. Also, we checked proxy object but proxy object was correct.
- After checking respective daemonset of those pods, we get to know that the ENV variables were not picked up correctly. We updated daemonset configuration by adding 172.30.0.1 in ENV variables of that container spec. After that the check to clusterversion with 172.30.0.1:443 happened successfully. But again there were some errors which are related to NVIDIA repo which are application specific.
- We checked the other pods’ yaml which are stuck in init state. We get to know there are specification mentioned for init_containers. At node level, we tried checking logs of those init containers with crictl logs but those logs were again specific to application only.
- I suggested you to check this problem with NVIDIA vendor first as most of the pods are stuck in init phase with no confirmed reason/error. In case, NVIDIA says something needs to be checked from OCP end then we can check it after you provide their analysis and exact error messages.
- Currently,as you are going to check with NVIDIA vendor, I am keeping the case status as waiting on customer. Please update us in case you require help from OpenShift end.### 2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
- [ x] kubernetes pods status:
kubectl get pods --all-namespaces
- [ x allds.txt allpods.txt nvidiapods.txt
] kubernetes daemonset status: kubectl get ds --all-namespaces
-
If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
-
If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
-
Output of running a container on the GPU machine:
docker run -it alpine echo foo
-
Docker configuration file:
cat /etc/docker/daemon.json
-
Docker runtime configuration:
docker info | grep runtime
-
NVIDIA shared directory:
ls -la /run/nvidia
-
NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
-
NVIDIA driver directory:
ls -la /run/nvidia/driver
-
kubelet logs
journalctl -u kubelet > kubelet.logs
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 34 (18 by maintainers)
You can let them know that these are public registries and no login is required: https://ngc.nvidia.com/catalog/containers/nvidia:driver
cc: @zvonkok @kpouget for help with RH ticket: 02949372
@kpouget
kind: ClusterServiceVersion name: gpu-operator-certified.v1.7.0 namespace: openshift-operators - apiVersion: operators.coreos.com/v1 kind: OperatorCondition name: gpu-operator-certified.v1.7.0 namespace: openshift-operators