gpu-operator: Failing to install NVIDIA GPU Operator v1.8.2 in OCP

1. Quick Debug Checklist

Are you running on an Ubuntu 18.04 node?
Are you running Kubernetes v1.13+?
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
Do you have i2c_core and ipmi_msghandler loaded on the nodes?
Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I am running on OpenShift (version 4.6.48) on AWS and am trying to install version 1.8.2 of the NVIDIA GPU operator via the instructions at Installing the NVIDIA GPU Operator — NVIDIA Cloud Native Technologies documentation. However, step 8 is failing in that there is no installplan returned by the following command:

oc get installplan -n nvidia-gpu-operator

I need to install v1.8.2 (or later) of the NVIDIA GPU operator because of the bug reported in v1.8.0 at Appendix — NVIDIA Cloud Native Technologies documentation which says “GPU Operator v1.8.0 does not work well on RedHat OpenShift when a cluster-wide Proxy object is configured and causes constant restarts of driver container. This will be fixed in an upcoming patch release v1.8.2.”

Can you tell me how to debug why the installplan is not being created?

Thanks in advance for your help, Keith

2. Steps to reproduce the issue

Steps 1 through 8 from (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/install-gpu-ocp.html#installing-the-nvidia-gpu-operator-using-the-cli)

3. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods --all-namespaces
kubernetes daemonset status: kubectl get ds --all-namespaces
If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
Output of running a container on the GPU machine: docker run -it alpine echo foo
Docker configuration file: cat /etc/docker/daemon.json
Docker runtime configuration: docker info | grep runtime
NVIDIA shared directory: ls -la /run/nvidia
NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
NVIDIA driver directory: ls -la /run/nvidia/driver
kubelet logs journalctl -u kubelet > kubelet.logs

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 28 (16 by maintainers)

Most upvoted comments

we solved the problem with @smithbk, the Catalog Operator was blocked by another operator failing to install, so the GPU Operator subscription was never progressed 😕

kpouget on Apr 5, 2022

@kpouget Yes, a debug session would be great. I’ll send an invite soon.

Here is the info you requested.

$ oc get operatorgroups -n openshift-operators
NAME               AGE
global-operators   453d
$ oc describe subscription.operators.coreos.com/gpu-operator-certified -n openshift-operators
Name:         gpu-operator-certified
Namespace:    openshift-operators
Labels:       operators.coreos.com/gpu-operator-certified.openshift-operators=
Annotations:  <none>
API Version:  operators.coreos.com/v1alpha1
Kind:         Subscription
Metadata:
  Creation Timestamp:  2022-03-31T16:46:37Z
  Generation:          1
  Managed Fields:
    API Version:  operators.coreos.com/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:channel:
        f:installPlanApproval:
        f:name:
        f:source:
        f:sourceNamespace:
        f:startingCSV:
    Manager:      Mozilla
    Operation:    Update
    Time:         2022-03-31T16:46:37Z
    API Version:  operators.coreos.com/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:operators.coreos.com/gpu-operator-certified.openshift-operators:
    Manager:      olm
    Operation:    Update
    Time:         2022-03-31T16:46:37Z
    API Version:  operators.coreos.com/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:catalogHealth:
        f:conditions:
        f:lastUpdated:
    Manager:         catalog
    Operation:       Update
    Time:            2022-03-31T16:46:38Z
  Resource Version:  2269824991
  Self Link:         /apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/subscriptions/gpu-operator-certified
  UID:               bc8ea336-3279-45cb-838f-4fe7d400982a
Spec:
  Channel:                v1.7
  Install Plan Approval:  Automatic
  Name:                   gpu-operator-certified
  Source:                 certified-operators
  Source Namespace:       openshift-marketplace
  Starting CSV:           gpu-operator-certified.v1.7.1
Status:
  Catalog Health:
    Catalog Source Ref:
      API Version:       operators.coreos.com/v1alpha1
      Kind:              CatalogSource
      Name:              certified-operators
      Namespace:         openshift-marketplace
      Resource Version:  2269721282
      UID:               7837283d-62f7-4f1b-8912-f19f5d7ffe63
    Healthy:             true
    Last Updated:        2022-03-31T16:46:39Z
    Catalog Source Ref:
      API Version:       operators.coreos.com/v1alpha1
      Kind:              CatalogSource
      Name:              community-operators
      Namespace:         openshift-marketplace
      Resource Version:  2269247723
      UID:               e170f748-bba5-4539-ab27-c344fef84098
    Healthy:             true
    Last Updated:        2022-03-31T16:46:39Z
    Catalog Source Ref:
      API Version:       operators.coreos.com/v1alpha1
      Kind:              CatalogSource
      Name:              ibm-operator-catalog
      Namespace:         openshift-marketplace
      Resource Version:  2219599298
      UID:               e36dfbbd-2f21-4ea9-bc50-e18a9b477568
    Healthy:             true
    Last Updated:        2022-03-31T16:46:39Z
    Catalog Source Ref:
      API Version:       operators.coreos.com/v1alpha1
      Kind:              CatalogSource
      Name:              opencloud-operators
      Namespace:         openshift-marketplace
      Resource Version:  2219604854
      UID:               95de8c0f-e519-45fa-bb3f-79fb358518d4
    Healthy:             true
    Last Updated:        2022-03-31T16:46:39Z
    Catalog Source Ref:
      API Version:       operators.coreos.com/v1alpha1
      Kind:              CatalogSource
      Name:              redhat-marketplace
      Namespace:         openshift-marketplace
      Resource Version:  2269464219
      UID:               c03ac814-7ca8-448f-8099-474132fe801e
    Healthy:             true
    Last Updated:        2022-03-31T16:46:39Z
    Catalog Source Ref:
      API Version:       operators.coreos.com/v1alpha1
      Kind:              CatalogSource
      Name:              redhat-operators
      Namespace:         openshift-marketplace
      Resource Version:  2269469807
      UID:               c2680887-55c0-46ee-855a-b9e8725937d4
    Healthy:             true
    Last Updated:        2022-03-31T16:46:39Z
  Conditions:
    Last Transition Time:  2022-03-31T16:46:39Z
    Message:               all available catalogsources are healthy
    Reason:                AllCatalogSourcesHealthy
    Status:                False
    Type:                  CatalogSourcesUnhealthy
  Last Updated:            2022-03-31T16:46:39Z
Events:                    <none>

smithbk on Apr 5, 2022