kops: GPU bootstrap method not setting capacity
I’m using kops to deploy on AWS.
After updating the cluster, I see a few more lib files under /usr/lib so it seems like the bootstrap container did run.
However, the p2.xlarge instance still doesn’t have the capacity set:
{
"name": "ip-1-2-3-4.us-west-2.compute.internal",
"selfLink": "/api/v1/nodesip-1-2-3-4.us-west-2.compute.internal",
"uid": "xxx",
"resourceVersion": "104430",
"creationTimestamp": "2017-05-04T00:19:15Z",
"labels": {
"beta.kubernetes.io/arch": "amd64",
"beta.kubernetes.io/instance-type": "p2.xlarge",
"beta.kubernetes.io/os": "linux",
"failure-domain.beta.kubernetes.io/region": "us-west-2",
"failure-domain.beta.kubernetes.io/zone": "us-west-2c",
"kubernetes.io/hostname": "ip-1-2-3-4.us-west-2.compute.internal",
"kubernetes.io/role": "node",
"node-role.kubernetes.io/node": ""
},
"annotations": {
"node.alpha.kubernetes.io/ttl": "0",
"volumes.kubernetes.io/controller-managed-attach-detach": "true"
},
"Status": {
"Capacity": {
"alpha.kubernetes.io/nvidia-gpu": "0",
"cpu": "4",
"memory": "62884272Ki",
"pods": "110"
},
"Allocatable": {
"alpha.kubernetes.io/nvidia-gpu": "0",
"cpu": "4",
"memory": "62781872Ki",
"pods": "110"
},
In case this gets applied at startup … I’ve tried terminating all the VMs in my cluster … no dice.
I’ve also tried doing kops edit ig ... for the gpu node to add the label alpha.kubernetes.io/nvidia-gpu-name="Tesla K80" and cycling the gpu node (terminating/allowing restart), again no dice.
While I did the kops update cluster... and kops rolling-update cluster ... I’m not sure if the Accelerators:true setting is taking effect.
If I look at my k8s api server pod … I see the startup command is …
/usr/local/bin/kube-apiserver --address=127.0.0.1 --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,DefaultTolerationSeconds,ResourceQuota --allow-privileged=true --anonymous-auth=false --apiserver-count=1 --authorization-mode=AlwaysAllow --basic-auth-file=/srv/kubernetes/basic_auth.csv --client-ca-file=/srv/kubernetes/ca.crt --cloud-provider=aws --etcd-servers-overrides=/events#http://127.0.0.1:4002 --etcd-servers=http://127.0.0.1:4001 --insecure-port=8080 --kubelet-preferred-address-types=InternalIP,Hostname,ExternalIP,LegacyHostIP --secure-port=443 --service-cluster-ip-range=100.64.0.0/13 --storage-backend=etcd2 --tls-cert-file=/srv/kubernetes/server.cert --tls-private-key-file=/srv/kubernetes/server.key --token-auth-file=/srv/kubernetes/known_tokens.csv --v=2 1>>/var/log/kube-apiserver.log 2>&1
which doesn’t have the feature gates flag. So perhaps it’s not actually getting set?
I’m running client/server:
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:44:38Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:22:08Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
and kops:
$kops version
Version 1.6.0-beta.1 (git-77f222d)
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 34 (13 by maintainers)
awesome it worked
Just to help people around as I ran in to the same problem. The best solution for now is to create a custom AMI for gpu instancegroups unless it is handled in kops properly (currently gpu detection is perhaps only valid for p2 instances, and the race).
and kublet detects capacity properly when new igs are created
Thanks @diwu1989 - your details in the other ticket allowed me to get this up and running.
That said … I am looking forward to when kops can support this seamlessly.