kops: Not able to install Nvidia GPU operator inside the cluster

/kind bug

1. What kops version are you running? The command kops version, will display this information. 1.22.1

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:16:05Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:32:41Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using? aws

4. What commands did you run? What is the simplest way to reproduce this issue?

helm install --wait --generate-name \                                                                                                                                      
   nvidia/gpu-operator \
   --set operator.defaultRuntime="containerd"

5. What happened after the commands executed? After creating the cluster using the following command

export NODE_SIZE="t3.2xlarge"
export MASTER_SIZE="t3.medium"
export ZONES="us-west-2a"

kops create cluster <cluster_name>\
  --node-count 2 \
  --zones $ZONES \
  --node-size $NODE_SIZE \
  --master-size $MASTER_SIZE \
  --master-zones $ZONES \

I tried to add the Nvidia GPU operator from here. Only the GPU operator node feature discovery pod comes up, the rest are stuck.

default                  gpu-operator-1636665352-node-feature-discovery-master-647f9fn9t      1/1     Running       0          2m24s
default                  gpu-operator-1636665352-node-feature-discovery-worker-hjmdx          1/1     Running       0          2m24s
default                  gpu-operator-1636665352-node-feature-discovery-worker-v7kmp          1/1     Running       0          2m24s
default                  gpu-operator-1636665352-node-feature-discovery-worker-xw4n9          1/1     Running       0          2m24s
default                  gpu-operator-599764446c-mwvcm                                        1/1     Running       0          2m24s
gpu-operator-resources   gpu-feature-discovery-l5qgk                                          0/1     Terminating   0          19s
gpu-operator-resources   nvidia-dcgm-exporter-29fp5                                           0/1     Terminating   0          19s
gpu-operator-resources   nvidia-dcgm-t286v                                                    0/1     Terminating   0          19s
gpu-operator-resources   nvidia-device-plugin-daemonset-x7gkb                                 0/1     Terminating   0          19s
gpu-operator-resources   nvidia-driver-daemonset-69987                                        0/1     Init:0/1      1          2m7s
gpu-operator-resources   nvidia-driver-daemonset-t2scl                                        0/1     Init:0/1      2          2m7s
kube-system              dns-controller-c459588c4-vj82n                                       1/1     Running       0          8m37s
kube-system              etcd-manager-events-ip-172-20-58-46.us-west-2.compute.internal       1/1     Running       0          8m43s
kube-system              etcd-manager-main-ip-172-20-58-46.us-west-2.compute.internal         1/1     Running       0          8m38s
kube-system              kops-controller-5lmgs                                                1/1     Running       0          8m37s
kube-system              kube-apiserver-ip-172-20-58-46.us-west-2.compute.internal            2/2     Running       0          8m12s
kube-system              kube-controller-manager-ip-172-20-58-46.us-west-2.compute.internal   1/1     Running       3          9m16s
kube-system              kube-dns-696cb84c7-8sr5l                                             2/3     Running       0          4s
kube-system              kube-dns-696cb84c7-fwx2l                                             3/3     Terminating   0          110s
kube-system              kube-dns-696cb84c7-wk2bh                                             3/3     Running       0          54s
kube-system              kube-dns-autoscaler-55f8f75459-pg4tz                                 1/1     Running       0          54s
kube-system              kube-proxy-ip-172-20-45-43.us-west-2.compute.internal                1/1     Running       0          6m41s
kube-system              kube-proxy-ip-172-20-58-46.us-west-2.compute.internal                1/1     Running       0          8m19s
kube-system              kube-proxy-ip-172-20-63-91.us-west-2.compute.internal                1/1     Running       0          6m58s
kube-system              kube-scheduler-ip-172-20-58-46.us-west-2.compute.internal            1/1     Running       0          8m32s

I see the following events in the namespace with warnings.

3m22s       Normal    Pulling            pod/nvidia-operator-validator-xmh7b            Pulling image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2"
3m5s        Normal    Pulled             pod/nvidia-operator-validator-xmh7b            Successfully pulled image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" in 16.456875783s
3m5s        Warning   Failed             pod/nvidia-operator-validator-xmh7b            Error: cannot find volume "driver-install-path" to mount into container "driver-validation"

Then I tried to manually edit the config.toml to what was given here I see the daemonset running now with feature discovery.

gpu-operator-resources   gpu-feature-discovery-2lrn6                                           0/1     Init:0/1   0          7m10s
gpu-operator-resources   gpu-feature-discovery-4lnd8                                           0/1     Init:0/1   0          7m11s
gpu-operator-resources   nvidia-container-toolkit-daemonset-r9929                              1/1     Running    0          7m12s
gpu-operator-resources   nvidia-container-toolkit-daemonset-vl9sx                              1/1     Running    0          7m11s
gpu-operator-resources   nvidia-dcgm-9qr7x                                                     0/1     Init:0/1   0          7m10s
gpu-operator-resources   nvidia-dcgm-exporter-c2csj                                            0/1     Init:0/1   0          7m12s
gpu-operator-resources   nvidia-dcgm-exporter-qbgcq                                            0/1     Init:0/1   0          7m11s
gpu-operator-resources   nvidia-dcgm-kttcn                                                     0/1     Init:0/1   0          7m12s
gpu-operator-resources   nvidia-device-plugin-daemonset-c9h2s                                  0/1     Init:0/1   0          7m12s
gpu-operator-resources   nvidia-device-plugin-daemonset-pt9xg                                  0/1     Init:0/1   0          7m11s
gpu-operator-resources   nvidia-driver-daemonset-pgc6r                                         1/1     Running    0          7m12s
gpu-operator-resources   nvidia-driver-daemonset-vbgct                                         1/1     Running    0          7m11s
gpu-operator-resources   nvidia-operator-validator-4vpfl                                       0/1     Init:0/4   0          7m11s
gpu-operator-resources   nvidia-operator-validator-72xf7                                       0/1     Init:0/4   0          7m12s

6. What did you expect to happen?

I expected the GPU operator to bring up all the pods like this

$ kubectl get pods -A
NAMESPACE                NAME                                                              READY   STATUS      RESTARTS   AGE
default                  gpu-operator-1636674406-node-feature-discovery-master-6cb7g2k9z   1/1     Running     0          5m30s
default                  gpu-operator-1636674406-node-feature-discovery-worker-kmbxg       1/1     Running     0          5m30s
default                  gpu-operator-599764446c-vtml2                                     1/1     Running     0          5m30s
gpu-operator-resources   gpu-feature-discovery-bxhc8                                       1/1     Running     0          5m13s
gpu-operator-resources   nvidia-container-toolkit-daemonset-khlrr                          1/1     Running     0          5m13s
gpu-operator-resources   nvidia-cuda-validator-hczvc                                       0/1     Completed   0          73s
gpu-operator-resources   nvidia-dcgm-exporter-r9dd5                                        1/1     Running     0          5m13s
gpu-operator-resources   nvidia-dcgm-x6hjs                                                 1/1     Running     0          5m13s
gpu-operator-resources   nvidia-device-plugin-daemonset-96fg9                              1/1     Running     0          5m13s
gpu-operator-resources   nvidia-device-plugin-validator-5hrhm                              0/1     Completed   0          12s
gpu-operator-resources   nvidia-driver-daemonset-j6knj                                     1/1     Running     0          5m13s
gpu-operator-resources   nvidia-operator-validator-72q5f                                   1/1     Running     0          5m13s
kube-system              calico-kube-controllers-85c867d48-rjsbp                           1/1     Running     0          7m32s
kube-system              calico-node-f4p7n                                                 1/1     Running     0          7m32s
kube-system              coredns-74ff55c5b-frnfn                                           1/1     Running     0          7m38s
kube-system              coredns-74ff55c5b-rb9kp                                           1/1     Running     0          7m38s
kube-system              etcd-ip-172-31-10-61                                              1/1     Running     0          7m47s
kube-system              kube-apiserver-ip-172-31-10-61                                    1/1     Running     0          7m47s
kube-system              kube-controller-manager-ip-172-31-10-61                           1/1     Running     0          7m47s
kube-system              kube-proxy-fccbv                                                  1/1     Running     0          7m38s
kube-system              kube-scheduler-ip-172-31-10-61                                    1/1     Running     0          7m47s

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

kind: Cluster
metadata:
  creationTimestamp: "2021-11-12T02:48:52Z"
  generation: 1
  name: abc.k8s.local
spec:
  api:
    loadBalancer:
      class: Classic
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://abc-k8s/abc.k8s.local
  containerRuntime: containerd
  containerd:
    nvidiaGPU:
      enabled: true
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: 1.22.2
  masterInternalName: api.internal.abc.k8s.local
  masterPublicName: api.abc.k8s.local
  networkCIDR: 172.20.0.0/16
  networking:
    kubenet: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  - ::/0
  subnets:
  - cidr: 172.20.32.0/19
    name: us-west-2a
    type: Public
    zone: us-west-2a
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-11-12T02:48:53Z"
  labels:
    kops.k8s.io/cluster: abc.k8s.local
  name: master-us-west-2a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211015
  instanceMetadata:
    httpPutResponseHopLimit: 3
    httpTokens: required
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2a
  role: Master
  subnets:
  - us-west-2a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-11-12T02:48:53Z"
  labels:
    kops.k8s.io/cluster: abc.k8s.local
  name: nodes-us-west-2a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211015
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: t3.2xlarge
  maxSize: 2
  minSize: 2
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-us-west-2a
  role: Node
  subnets:
  - us-west-2a

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 22 (9 by maintainers)

Most upvoted comments

I meant the operator installed via the Helm chart.

Yes, I removed that. The GPU test is also working.

arunraman on Nov 12, 2021