gpu-operator: Nvidia container toolkit daemonset pod fails with ErrImagePull

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn’t apply to you and add more information where it makes sense.

1. Quick Debug Checklist

Are you running on an Ubuntu 18.04 node?
Are you running Kubernetes v1.13+?
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
Do you have i2c_core and ipmi_msghandler loaded on the nodes?
Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

We are using GPU operator v1.6.2 in one of our E2E tests in cluster-api-provider-aws, it was working 2 days back, but now it has started failing, as nvidia-container-toolkit-daemonset pod failed to come up with below error:

Failed to pull image “nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59”: rpc error: code = NotFound desc = failed to pull and unpack image “nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59”: failed to resolve reference “nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59”: nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59: not found

Is there anything changed recently which could be causing this issue?

2. Steps to reproduce the issue

Reference manifest used.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 16 (7 by maintainers)

Most upvoted comments

For the plugin-validator error, it needs an available GPU on the node, if you have other pods consuming GPUs that might happen. If you want to disable plugin validation you can set as below with validator component in ClusterPolicy.

  validator:
    repository: nvcr.io/nvidia/cloud-native
    image: gpu-operator-validator
    version: "v1.11.1"
    imagePullPolicy: IfNotPresent
    plugin:
      env: 
        - name: WITH_WORKLOAD
          value: "false"

For getting templates for each release you can run

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && helm repo update

helm template nvidia/gpu-operator --version=v1.11.1

shivamerla on Aug 10, 2022

Below is the template from helm for reference for v1.11.1.

---
# Source: gpu-operator/charts/node-feature-discovery/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: node-feature-discovery
  labels:
    helm.sh/chart: node-feature-discovery-0.10.1
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.10.1"
    app.kubernetes.io/managed-by: Helm
---
# Source: gpu-operator/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-operator
  labels:
    app.kubernetes.io/component: "gpu-operator"
---
# Source: gpu-operator/charts/node-feature-discovery/templates/nfd-worker-conf.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: release-name-node-feature-discovery-worker-conf
  labels:
    helm.sh/chart: node-feature-discovery-0.10.1
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.10.1"
    app.kubernetes.io/managed-by: Helm
data:
  nfd-worker.conf: |-
    sources:
      pci:
        deviceClassWhitelist:
        - "02"
        - "0200"
        - "0207"
        - "0300"
        - "0302"
        deviceLabelFields:
        - vendor
---
# Source: gpu-operator/charts/node-feature-discovery/templates/nodefeaturerule-crd.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    controller-gen.kubebuilder.io/version: v0.7.0
  creationTimestamp: null
  name: nodefeaturerules.nfd.k8s-sigs.io
spec:
  group: nfd.k8s-sigs.io
  names:
    kind: NodeFeatureRule
    listKind: NodeFeatureRuleList
    plural: nodefeaturerules
    singular: nodefeaturerule
  scope: Cluster
  versions:
  - name: v1alpha1
    schema:
      openAPIV3Schema:
        description: NodeFeatureRule resource specifies a configuration for feature-based
          customization of node objects, such as node labeling.
        properties:
          apiVersion:
            description: 'APIVersion defines the versioned schema of this representation
              of an object. Servers should convert recognized schemas to the latest
              internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
            type: string
          kind:
            description: 'Kind is a string value representing the REST resource this
              object represents. Servers may infer this from the endpoint the client
              submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
            type: string
          metadata:
            type: object
          spec:
            description: NodeFeatureRuleSpec describes a NodeFeatureRule.
            properties:
              rules:
                description: Rules is a list of node customization rules.
                items:
                  description: Rule defines a rule for node customization such as
                    labeling.
                  properties:
                    labels:
                      additionalProperties:
                        type: string
                      description: Labels to create if the rule matches.
                      type: object
                    labelsTemplate:
                      description: LabelsTemplate specifies a template to expand for
                        dynamically generating multiple labels. Data (after template
                        expansion) must be keys with an optional value (<key>[=<value>])
                        separated by newlines.
                      type: string
                    matchAny:
                      description: MatchAny specifies a list of matchers one of which
                        must match.
                      items:
                        description: MatchAnyElem specifies one sub-matcher of MatchAny.
                        properties:
                          matchFeatures:
                            description: MatchFeatures specifies a set of matcher
                              terms all of which must match.
                            items:
                              description: FeatureMatcherTerm defines requirements
                                against one feature set. All requirements (specified
                                as MatchExpressions) are evaluated against each element
                                in the feature set.
                              properties:
                                feature:
                                  type: string
                                matchExpressions:
                                  additionalProperties:
                                    description: "MatchExpression specifies an expression
                                      to evaluate against a set of input values. It
                                      contains an operator that is applied when matching
                                      the input and an array of values that the operator
                                      evaluates the input against. \n NB: CreateMatchExpression
                                      or MustCreateMatchExpression() should be used
                                      for     creating new instances. NB: Validate()
                                      must be called if Op or Value fields are modified
                                      or if a new     instance is created from scratch
                                      without using the helper functions."
                                    properties:
                                      op:
                                        description: Op is the operator to be applied.
                                        enum:
                                        - In
                                        - NotIn
                                        - InRegexp
                                        - Exists
                                        - DoesNotExist
                                        - Gt
                                        - Lt
                                        - GtLt
                                        - IsTrue
                                        - IsFalse
                                        type: string
                                      value:
                                        description: Value is the list of values that
                                          the operand evaluates the input against.
                                          Value should be empty if the operator is
                                          Exists, DoesNotExist, IsTrue or IsFalse.
                                          Value should contain exactly one element
                                          if the operator is Gt or Lt and exactly
                                          two elements if the operator is GtLt. In
                                          other cases Value should contain at least
                                          one element.
                                        items:
                                          type: string
                                        type: array
                                    required:
                                    - op
                                    type: object
                                  description: MatchExpressionSet contains a set of
                                    MatchExpressions, each of which is evaluated against
                                    a set of input values.
                                  type: object
                              required:
                              - feature
                              - matchExpressions
                              type: object
                            type: array
                        required:
                        - matchFeatures
                        type: object
                      type: array
                    matchFeatures:
                      description: MatchFeatures specifies a set of matcher terms
                        all of which must match.
                      items:
                        description: FeatureMatcherTerm defines requirements against
                          one feature set. All requirements (specified as MatchExpressions)
                          are evaluated against each element in the feature set.
                        properties:
                          feature:
                            type: string
                          matchExpressions:
                            additionalProperties:
                              description: "MatchExpression specifies an expression
                                to evaluate against a set of input values. It contains
                                an operator that is applied when matching the input
                                and an array of values that the operator evaluates
                                the input against. \n NB: CreateMatchExpression or
                                MustCreateMatchExpression() should be used for     creating
                                new instances. NB: Validate() must be called if Op
                                or Value fields are modified or if a new     instance
                                is created from scratch without using the helper functions."
                              properties:
                                op:
                                  description: Op is the operator to be applied.
                                  enum:
                                  - In
                                  - NotIn
                                  - InRegexp
                                  - Exists
                                  - DoesNotExist
                                  - Gt
                                  - Lt
                                  - GtLt
                                  - IsTrue
                                  - IsFalse
                                  type: string
                                value:
                                  description: Value is the list of values that the
                                    operand evaluates the input against. Value should
                                    be empty if the operator is Exists, DoesNotExist,
                                    IsTrue or IsFalse. Value should contain exactly
                                    one element if the operator is Gt or Lt and exactly
                                    two elements if the operator is GtLt. In other
                                    cases Value should contain at least one element.
                                  items:
                                    type: string
                                  type: array
                              required:
                              - op
                              type: object
                            description: MatchExpressionSet contains a set of MatchExpressions,
                              each of which is evaluated against a set of input values.
                            type: object
                        required:
                        - feature
                        - matchExpressions
                        type: object
                      type: array
                    name:
                      description: Name of the rule.
                      type: string
                    vars:
                      additionalProperties:
                        type: string
                      description: Vars is the variables to store if the rule matches.
                        Variables do not directly inflict any changes in the node
                        object. However, they can be referenced from other rules enabling
                        more complex rule hierarchies, without exposing intermediary
                        output values as labels.
                      type: object
                    varsTemplate:
                      description: VarsTemplate specifies a template to expand for
                        dynamically generating multiple variables. Data (after template
                        expansion) must be keys with an optional value (<key>[=<value>])
                        separated by newlines.
                      type: string
                  required:
                  - name
                  type: object
                type: array
            required:
            - rules
            type: object
        required:
        - spec
        type: object
    served: true
    storage: true
status:
  acceptedNames:
    kind: ""
    plural: ""
  conditions: []
  storedVersions: []
---
# Source: gpu-operator/charts/node-feature-discovery/templates/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: release-name-node-feature-discovery
  labels:
    helm.sh/chart: node-feature-discovery-0.10.1
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.10.1"
    app.kubernetes.io/managed-by: Helm
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  # when using command line flag --resource-labels to create extended resources
  # you will need to uncomment "- nodes/status"
  # - nodes/status
  verbs:
  - get
  - patch
  - update
  - list
- apiGroups:
  - nfd.k8s-sigs.io
  resources:
  - nodefeaturerules
  verbs:
  - get
  - list
  - watch
---
# Source: gpu-operator/templates/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  creationTimestamp: null
  name: gpu-operator
  labels:
    app.kubernetes.io/component: "gpu-operator"
    
rules:
- apiGroups:
  - config.openshift.io
  resources:
  - proxies
  verbs:
  - get
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - roles
  - rolebindings
  - clusterroles
  - clusterrolebindings
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - pods
  - services
  - endpoints
  - persistentvolumeclaims
  - events
  - configmaps
  - secrets
  - serviceaccounts
  - nodes
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - get
  - list
  - create
  - watch
  - update
- apiGroups:
  - apps
  resources:
  - deployments
  - daemonsets
  - replicasets
  - statefulsets
  verbs:
  - '*'
- apiGroups:
  - monitoring.coreos.com
  resources:
  - servicemonitors
  - prometheusrules
  verbs:
  - get
  - list
  - create
  - watch
  - update
- apiGroups:
  - nvidia.com
  resources:
  - '*'
  verbs:
  - '*'
- apiGroups:
  - scheduling.k8s.io
  resources:
  - priorityclasses
  verbs:
  - get
  - list
  - watch
  - create
- apiGroups:
  - security.openshift.io
  resources:
  - securitycontextconstraints
  verbs:
  - '*'
- apiGroups:
  - policy
  resources:
  - podsecuritypolicies
  verbs:
  - use
  resourceNames:
  - gpu-operator-restricted
- apiGroups:
  - policy
  resources:
  - podsecuritypolicies
  verbs:
  - create
  - get
  - update
  - list
- apiGroups:
  - config.openshift.io
  resources:
  - clusterversions
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  - coordination.k8s.io
  resources:
  - configmaps
  - leases
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete
- apiGroups:
  - node.k8s.io
  resources:
  - runtimeclasses
  verbs:
  - get
  - list
  - create
  - update
  - watch
- apiGroups:
  - image.openshift.io
  resources:
  - imagestreams
  verbs:
  - get
  - list
  - watch
---
# Source: gpu-operator/charts/node-feature-discovery/templates/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: release-name-node-feature-discovery
  labels:
    helm.sh/chart: node-feature-discovery-0.10.1
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.10.1"
    app.kubernetes.io/managed-by: Helm
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: release-name-node-feature-discovery
subjects:
- kind: ServiceAccount
  name: node-feature-discovery
  namespace: default
---
# Source: gpu-operator/templates/rolebinding.yaml
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: gpu-operator
  labels:
    app.kubernetes.io/component: "gpu-operator"
    
subjects:
- kind: ServiceAccount
  name: gpu-operator
  namespace: default
- kind: ServiceAccount
  name: node-feature-discovery
  namespace: default
roleRef:
  kind: ClusterRole
  name: gpu-operator
  apiGroup: rbac.authorization.k8s.io
---
# Source: gpu-operator/charts/node-feature-discovery/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: release-name-node-feature-discovery-master
  labels:
    helm.sh/chart: node-feature-discovery-0.10.1
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.10.1"
    app.kubernetes.io/managed-by: Helm
    role: master
spec:
  type: ClusterIP
  ports:
    - port: 8080
      targetPort: grpc
      protocol: TCP
      name: grpc
  selector:
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
---
# Source: gpu-operator/charts/node-feature-discovery/templates/worker.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name:  release-name-node-feature-discovery-worker
  labels:
    helm.sh/chart: node-feature-discovery-0.10.1
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.10.1"
    app.kubernetes.io/managed-by: Helm
    role: worker
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-feature-discovery
      app.kubernetes.io/instance: release-name
      role: worker
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-feature-discovery
        app.kubernetes.io/instance: release-name
        role: worker
      annotations:
        {}
    spec:
      dnsPolicy: ClusterFirstWithHostNet
      serviceAccountName: node-feature-discovery
      securityContext:
        {}
      containers:
      - name: worker
        securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - ALL
            readOnlyRootFilesystem: true
            runAsNonRoot: true
        image: "k8s.gcr.io/nfd/node-feature-discovery:v0.10.1"
        imagePullPolicy: IfNotPresent
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        resources:
            {}
        command:
        - "nfd-worker"
        args:
        - "--server=release-name-node-feature-discovery-master:8080"
        volumeMounts:
        - name: host-boot
          mountPath: "/host-boot"
          readOnly: true
        - name: host-os-release
          mountPath: "/host-etc/os-release"
          readOnly: true
        - name: host-sys
          mountPath: "/host-sys"
          readOnly: true
        - name: host-usr-lib
          mountPath: "/host-usr/lib"
          readOnly: true
        - name: source-d
          mountPath: "/etc/kubernetes/node-feature-discovery/source.d/"
          readOnly: true
        - name: features-d
          mountPath: "/etc/kubernetes/node-feature-discovery/features.d/"
          readOnly: true
        - name: nfd-worker-conf
          mountPath: "/etc/kubernetes/node-feature-discovery"
          readOnly: true
      volumes:
        - name: host-boot
          hostPath:
            path: "/boot"
        - name: host-os-release
          hostPath:
            path: "/etc/os-release"
        - name: host-sys
          hostPath:
            path: "/sys"
        - name: host-usr-lib
          hostPath:
            path: "/usr/lib"
        - name: source-d
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/source.d/"
        - name: features-d
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/features.d/"
        - name: nfd-worker-conf
          configMap:
            name: release-name-node-feature-discovery-worker-conf
            items:
              - key: nfd-worker.conf
                path: nfd-worker.conf
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Equal
          value: ""
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Equal
          value: present
---
# Source: gpu-operator/charts/node-feature-discovery/templates/master.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name:  release-name-node-feature-discovery-master
  labels:
    helm.sh/chart: node-feature-discovery-0.10.1
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.10.1"
    app.kubernetes.io/managed-by: Helm
    role: master
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: node-feature-discovery
      app.kubernetes.io/instance: release-name
      role: master
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-feature-discovery
        app.kubernetes.io/instance: release-name
        role: master
      annotations:
        {}
    spec:
      serviceAccountName: node-feature-discovery
      securityContext:
        {}
      containers:
        - name: master
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - ALL
            readOnlyRootFilesystem: true
            runAsNonRoot: true
          image: "k8s.gcr.io/nfd/node-feature-discovery:v0.10.1"
          imagePullPolicy: IfNotPresent
          livenessProbe:
            exec:
              command:
              - "/usr/bin/grpc_health_probe"
              - "-addr=:8080"
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            exec:
              command:
              - "/usr/bin/grpc_health_probe"
              - "-addr=:8080"
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 10
          ports:
          - containerPort: 8080
            name: grpc
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
          command:
            - "nfd-master"
          resources:
            {}
          args:
            - "--extra-label-ns=nvidia.com"
            ## By default, disable NodeFeatureRules controller for other than the default instances
            - "-featurerules-controller=true"
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: In
                values:
                - ""
            weight: 1
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: In
                values:
                - ""
            weight: 1
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Equal
          value: ""
        - effect: NoSchedule
          key: node-role.kubernetes.io/control-plane
          operator: Equal
          value: ""
---
# Source: gpu-operator/templates/operator.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-operator
  labels:
    app.kubernetes.io/component: "gpu-operator"
    
spec:
  replicas: 1
  selector:
    matchLabels:
      
      app.kubernetes.io/component: "gpu-operator"
      app: "gpu-operator"
  template:
    metadata:
      labels:
        
        app.kubernetes.io/component: "gpu-operator"
        app: "gpu-operator"
      annotations:
        openshift.io/scc: restricted-readonly
    spec:
      serviceAccountName: gpu-operator
      priorityClassName: system-node-critical
      containers:
      - name: gpu-operator
        image: nvcr.io/nvidia/gpu-operator:v1.11.1
        imagePullPolicy: IfNotPresent
        command: ["gpu-operator"]
        args:
        - --leader-elect
        env:
        - name: WATCH_NAMESPACE
          value: ""
        - name: OPERATOR_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        volumeMounts:
          - name: host-os-release
            mountPath: "/host-etc/os-release"
            readOnly: true
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8081
          initialDelaySeconds: 15
          periodSeconds: 20
        readinessProbe:
          httpGet:
            path: /readyz
            port: 8081
          initialDelaySeconds: 5
          periodSeconds: 10
        resources:
          limits:
            cpu: 500m
            memory: 350Mi
          requests:
            cpu: 200m
            memory: 100Mi
        ports:
          - name: metrics
            containerPort: 8080
      volumes:
        - name: host-os-release
          hostPath:
            path: "/etc/os-release"
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: In
                values:
                - ""
            weight: 1
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Equal
          value: ""
---
# Source: gpu-operator/templates/clusterpolicy.yaml
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
  namespace: default
  labels:
    app.kubernetes.io/component: "gpu-operator"
    
spec:
  operator:
    defaultRuntime: docker
    runtimeClass: nvidia
    initContainer:
      repository: nvcr.io/nvidia
      image: cuda
      version: "11.6.0-base-ubi8"
      imagePullPolicy: IfNotPresent
  daemonsets:
    tolerations: 
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
    priorityClassName: system-node-critical
  validator:
    repository: nvcr.io/nvidia/cloud-native
    image: gpu-operator-validator
    version: "v1.11.1"
    imagePullPolicy: IfNotPresent
    plugin:
      env: 
        - name: WITH_WORKLOAD
          value: "true"

  mig:
    strategy: single
  psp:
    enabled: false
  driver:
    enabled: true
    repository: nvcr.io/nvidia
    image: driver
    version: "515.48.07"
    imagePullPolicy: IfNotPresent
    rdma:
      enabled: false
      useHostMofed: false
    manager:
      repository: nvcr.io/nvidia/cloud-native
      image: k8s-driver-manager
      version: "v0.4.1"
      imagePullPolicy: IfNotPresent
      env: 
        - name: ENABLE_AUTO_DRAIN
          value: "true"
        - name: DRAIN_USE_FORCE
          value: "false"
        - name: DRAIN_POD_SELECTOR_LABEL
          value: ""
        - name: DRAIN_TIMEOUT_SECONDS
          value: 0s
        - name: DRAIN_DELETE_EMPTYDIR_DATA
          value: "false"
    repoConfig: 
      configMapName: ""
    certConfig: 
      name: ""
    licensingConfig: 
      configMapName: ""
      nlsEnabled: false
    virtualTopology: 
      config: ""
    kernelModuleConfig: 
      name: ""
    rollingUpdate:
      maxUnavailable: "1"
  toolkit:
    enabled: true
    repository: nvcr.io/nvidia/k8s
    image: container-toolkit
    version: "v1.10.0-ubuntu20.04"
    imagePullPolicy: IfNotPresent
  devicePlugin:
    repository: nvcr.io/nvidia
    image: k8s-device-plugin
    version: "v0.12.2-ubi8"
    imagePullPolicy: IfNotPresent
    env: 
      - name: PASS_DEVICE_SPECS
        value: "true"
      - name: FAIL_ON_INIT_ERROR
        value: "true"
      - name: DEVICE_LIST_STRATEGY
        value: envvar
      - name: DEVICE_ID_STRATEGY
        value: uuid
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: all
    config: 
      default: ""
      name: ""
  dcgm:
    enabled: false
    repository: nvcr.io/nvidia/cloud-native
    image: dcgm
    version: "2.4.5-1-ubuntu20.04"
    imagePullPolicy: IfNotPresent
    hostPort: 5555
  dcgmExporter:
    repository: nvcr.io/nvidia/k8s
    image: dcgm-exporter
    version: "2.4.5-2.6.7-ubuntu20.04"
    imagePullPolicy: IfNotPresent
    env: 
      - name: DCGM_EXPORTER_LISTEN
        value: :9400
      - name: DCGM_EXPORTER_KUBERNETES
        value: "true"
      - name: DCGM_EXPORTER_COLLECTORS
        value: /etc/dcgm-exporter/dcp-metrics-included.csv
  gfd:
    repository: nvcr.io/nvidia
    image: gpu-feature-discovery
    version: "v0.6.1-ubi8"
    imagePullPolicy: IfNotPresent
    env: 
      - name: GFD_SLEEP_INTERVAL
        value: 60s
      - name: GFD_FAIL_ON_INIT_ERROR
        value: "true"
  migManager:
    enabled: true
    repository: nvcr.io/nvidia/cloud-native
    image: k8s-mig-manager
    version: "v0.4.2-ubuntu20.04"
    imagePullPolicy: IfNotPresent
    env: 
      - name: WITH_REBOOT
        value: "false"
    config: 
      name: ""
    gpuClientsConfig: 
      name: ""
  nodeStatusExporter:
    enabled: false
    repository: nvcr.io/nvidia/cloud-native
    image: gpu-operator-validator
    version: "v1.11.1"
    imagePullPolicy: IfNotPresent

shivamerla on Aug 10, 2022

basically all manifests you have here needs to be updated for latest version.(Roles, NFD manifests, CRD and CR from values.yaml).

shivamerla on Aug 10, 2022