kops: Not able to install Nvidia GPU operator inside the cluster
/kind bug
1. What kops version are you running? The command kops version, will display
this information.
1.22.1
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:16:05Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:32:41Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using? aws
4. What commands did you run? What is the simplest way to reproduce this issue?
helm install --wait --generate-name \
nvidia/gpu-operator \
--set operator.defaultRuntime="containerd"
5. What happened after the commands executed? After creating the cluster using the following command
export NODE_SIZE="t3.2xlarge"
export MASTER_SIZE="t3.medium"
export ZONES="us-west-2a"
kops create cluster <cluster_name>\
--node-count 2 \
--zones $ZONES \
--node-size $NODE_SIZE \
--master-size $MASTER_SIZE \
--master-zones $ZONES \
I tried to add the Nvidia GPU operator from here. Only the GPU operator node feature discovery pod comes up, the rest are stuck.
default gpu-operator-1636665352-node-feature-discovery-master-647f9fn9t 1/1 Running 0 2m24s
default gpu-operator-1636665352-node-feature-discovery-worker-hjmdx 1/1 Running 0 2m24s
default gpu-operator-1636665352-node-feature-discovery-worker-v7kmp 1/1 Running 0 2m24s
default gpu-operator-1636665352-node-feature-discovery-worker-xw4n9 1/1 Running 0 2m24s
default gpu-operator-599764446c-mwvcm 1/1 Running 0 2m24s
gpu-operator-resources gpu-feature-discovery-l5qgk 0/1 Terminating 0 19s
gpu-operator-resources nvidia-dcgm-exporter-29fp5 0/1 Terminating 0 19s
gpu-operator-resources nvidia-dcgm-t286v 0/1 Terminating 0 19s
gpu-operator-resources nvidia-device-plugin-daemonset-x7gkb 0/1 Terminating 0 19s
gpu-operator-resources nvidia-driver-daemonset-69987 0/1 Init:0/1 1 2m7s
gpu-operator-resources nvidia-driver-daemonset-t2scl 0/1 Init:0/1 2 2m7s
kube-system dns-controller-c459588c4-vj82n 1/1 Running 0 8m37s
kube-system etcd-manager-events-ip-172-20-58-46.us-west-2.compute.internal 1/1 Running 0 8m43s
kube-system etcd-manager-main-ip-172-20-58-46.us-west-2.compute.internal 1/1 Running 0 8m38s
kube-system kops-controller-5lmgs 1/1 Running 0 8m37s
kube-system kube-apiserver-ip-172-20-58-46.us-west-2.compute.internal 2/2 Running 0 8m12s
kube-system kube-controller-manager-ip-172-20-58-46.us-west-2.compute.internal 1/1 Running 3 9m16s
kube-system kube-dns-696cb84c7-8sr5l 2/3 Running 0 4s
kube-system kube-dns-696cb84c7-fwx2l 3/3 Terminating 0 110s
kube-system kube-dns-696cb84c7-wk2bh 3/3 Running 0 54s
kube-system kube-dns-autoscaler-55f8f75459-pg4tz 1/1 Running 0 54s
kube-system kube-proxy-ip-172-20-45-43.us-west-2.compute.internal 1/1 Running 0 6m41s
kube-system kube-proxy-ip-172-20-58-46.us-west-2.compute.internal 1/1 Running 0 8m19s
kube-system kube-proxy-ip-172-20-63-91.us-west-2.compute.internal 1/1 Running 0 6m58s
kube-system kube-scheduler-ip-172-20-58-46.us-west-2.compute.internal 1/1 Running 0 8m32s
I see the following events in the namespace with warnings.
3m22s Normal Pulling pod/nvidia-operator-validator-xmh7b Pulling image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2"
3m5s Normal Pulled pod/nvidia-operator-validator-xmh7b Successfully pulled image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" in 16.456875783s
3m5s Warning Failed pod/nvidia-operator-validator-xmh7b Error: cannot find volume "driver-install-path" to mount into container "driver-validation"
Then I tried to manually edit the config.toml to what was given here I see the daemonset running now with feature discovery.
gpu-operator-resources gpu-feature-discovery-2lrn6 0/1 Init:0/1 0 7m10s
gpu-operator-resources gpu-feature-discovery-4lnd8 0/1 Init:0/1 0 7m11s
gpu-operator-resources nvidia-container-toolkit-daemonset-r9929 1/1 Running 0 7m12s
gpu-operator-resources nvidia-container-toolkit-daemonset-vl9sx 1/1 Running 0 7m11s
gpu-operator-resources nvidia-dcgm-9qr7x 0/1 Init:0/1 0 7m10s
gpu-operator-resources nvidia-dcgm-exporter-c2csj 0/1 Init:0/1 0 7m12s
gpu-operator-resources nvidia-dcgm-exporter-qbgcq 0/1 Init:0/1 0 7m11s
gpu-operator-resources nvidia-dcgm-kttcn 0/1 Init:0/1 0 7m12s
gpu-operator-resources nvidia-device-plugin-daemonset-c9h2s 0/1 Init:0/1 0 7m12s
gpu-operator-resources nvidia-device-plugin-daemonset-pt9xg 0/1 Init:0/1 0 7m11s
gpu-operator-resources nvidia-driver-daemonset-pgc6r 1/1 Running 0 7m12s
gpu-operator-resources nvidia-driver-daemonset-vbgct 1/1 Running 0 7m11s
gpu-operator-resources nvidia-operator-validator-4vpfl 0/1 Init:0/4 0 7m11s
gpu-operator-resources nvidia-operator-validator-72xf7 0/1 Init:0/4 0 7m12s
6. What did you expect to happen?
I expected the GPU operator to bring up all the pods like this
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default gpu-operator-1636674406-node-feature-discovery-master-6cb7g2k9z 1/1 Running 0 5m30s
default gpu-operator-1636674406-node-feature-discovery-worker-kmbxg 1/1 Running 0 5m30s
default gpu-operator-599764446c-vtml2 1/1 Running 0 5m30s
gpu-operator-resources gpu-feature-discovery-bxhc8 1/1 Running 0 5m13s
gpu-operator-resources nvidia-container-toolkit-daemonset-khlrr 1/1 Running 0 5m13s
gpu-operator-resources nvidia-cuda-validator-hczvc 0/1 Completed 0 73s
gpu-operator-resources nvidia-dcgm-exporter-r9dd5 1/1 Running 0 5m13s
gpu-operator-resources nvidia-dcgm-x6hjs 1/1 Running 0 5m13s
gpu-operator-resources nvidia-device-plugin-daemonset-96fg9 1/1 Running 0 5m13s
gpu-operator-resources nvidia-device-plugin-validator-5hrhm 0/1 Completed 0 12s
gpu-operator-resources nvidia-driver-daemonset-j6knj 1/1 Running 0 5m13s
gpu-operator-resources nvidia-operator-validator-72q5f 1/1 Running 0 5m13s
kube-system calico-kube-controllers-85c867d48-rjsbp 1/1 Running 0 7m32s
kube-system calico-node-f4p7n 1/1 Running 0 7m32s
kube-system coredns-74ff55c5b-frnfn 1/1 Running 0 7m38s
kube-system coredns-74ff55c5b-rb9kp 1/1 Running 0 7m38s
kube-system etcd-ip-172-31-10-61 1/1 Running 0 7m47s
kube-system kube-apiserver-ip-172-31-10-61 1/1 Running 0 7m47s
kube-system kube-controller-manager-ip-172-31-10-61 1/1 Running 0 7m47s
kube-system kube-proxy-fccbv 1/1 Running 0 7m38s
kube-system kube-scheduler-ip-172-31-10-61 1/1 Running 0 7m47s
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
kind: Cluster
metadata:
creationTimestamp: "2021-11-12T02:48:52Z"
generation: 1
name: abc.k8s.local
spec:
api:
loadBalancer:
class: Classic
type: Public
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://abc-k8s/abc.k8s.local
containerRuntime: containerd
containerd:
nvidiaGPU:
enabled: true
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- encryptedVolume: true
instanceGroup: master-us-west-2a
name: a
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- encryptedVolume: true
instanceGroup: master-us-west-2a
name: a
memoryRequest: 100Mi
name: events
iam:
allowContainerRegistry: true
legacy: false
kubelet:
anonymousAuth: false
kubernetesApiAccess:
- 0.0.0.0/0
- ::/0
kubernetesVersion: 1.22.2
masterInternalName: api.internal.abc.k8s.local
masterPublicName: api.abc.k8s.local
networkCIDR: 172.20.0.0/16
networking:
kubenet: {}
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 0.0.0.0/0
- ::/0
subnets:
- cidr: 172.20.32.0/19
name: us-west-2a
type: Public
zone: us-west-2a
topology:
dns:
type: Public
masters: public
nodes: public
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2021-11-12T02:48:53Z"
labels:
kops.k8s.io/cluster: abc.k8s.local
name: master-us-west-2a
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211015
instanceMetadata:
httpPutResponseHopLimit: 3
httpTokens: required
machineType: t3.medium
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-us-west-2a
role: Master
subnets:
- us-west-2a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2021-11-12T02:48:53Z"
labels:
kops.k8s.io/cluster: abc.k8s.local
name: nodes-us-west-2a
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211015
instanceMetadata:
httpPutResponseHopLimit: 1
httpTokens: required
machineType: t3.2xlarge
maxSize: 2
minSize: 2
nodeLabels:
kops.k8s.io/instancegroup: nodes-us-west-2a
role: Node
subnets:
- us-west-2a
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 22 (9 by maintainers)
Yes, I removed that. The GPU test is also working.