gpu-operator: Failing to install NVIDIA GPU Operator v1.8.2 in OCP
1. Quick Debug Checklist
- Are you running on an Ubuntu 18.04 node?
- Are you running Kubernetes v1.13+?
- Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- Do you have
i2c_core
andipmi_msghandler
loaded on the nodes? - Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
)
1. Issue or feature description
I am running on OpenShift (version 4.6.48) on AWS and am trying to install version 1.8.2 of the NVIDIA GPU operator via the instructions at Installing the NVIDIA GPU Operator — NVIDIA Cloud Native Technologies documentation. However, step 8 is failing in that there is no installplan returned by the following command:
oc get installplan -n nvidia-gpu-operator
I need to install v1.8.2 (or later) of the NVIDIA GPU operator because of the bug reported in v1.8.0 at Appendix — NVIDIA Cloud Native Technologies documentation which says “GPU Operator v1.8.0 does not work well on RedHat OpenShift when a cluster-wide Proxy object is configured and causes constant restarts of driver container. This will be fixed in an upcoming patch release v1.8.2.”
Can you tell me how to debug why the installplan is not being created?
Thanks in advance for your help, Keith
2. Steps to reproduce the issue
Steps 1 through 8 from (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/install-gpu-ocp.html#installing-the-nvidia-gpu-operator-using-the-cli)
3. Information to attach (optional if deemed irrelevant)
-
kubernetes pods status:
kubectl get pods --all-namespaces
-
kubernetes daemonset status:
kubectl get ds --all-namespaces
-
If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
-
If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
-
Output of running a container on the GPU machine:
docker run -it alpine echo foo
-
Docker configuration file:
cat /etc/docker/daemon.json
-
Docker runtime configuration:
docker info | grep runtime
-
NVIDIA shared directory:
ls -la /run/nvidia
-
NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
-
NVIDIA driver directory:
ls -la /run/nvidia/driver
-
kubelet logs
journalctl -u kubelet > kubelet.logs
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 28 (16 by maintainers)
we solved the problem with @smithbk, the Catalog Operator was blocked by another operator failing to install, so the GPU Operator subscription was never progressed 😕
@kpouget Yes, a debug session would be great. I’ll send an invite soon.
Here is the info you requested.