kops: GPU bootstrap method not setting capacity

I’ve followed the docs here

I’m using kops to deploy on AWS.

After updating the cluster, I see a few more lib files under /usr/lib so it seems like the bootstrap container did run.

However, the p2.xlarge instance still doesn’t have the capacity set:

        {
            "name": "ip-1-2-3-4.us-west-2.compute.internal",
            "selfLink": "/api/v1/nodesip-1-2-3-4.us-west-2.compute.internal",
            "uid": "xxx",
            "resourceVersion": "104430",
            "creationTimestamp": "2017-05-04T00:19:15Z",
            "labels": {
                "beta.kubernetes.io/arch": "amd64",
                "beta.kubernetes.io/instance-type": "p2.xlarge",
                "beta.kubernetes.io/os": "linux",
                "failure-domain.beta.kubernetes.io/region": "us-west-2",
                "failure-domain.beta.kubernetes.io/zone": "us-west-2c",
                "kubernetes.io/hostname": "ip-1-2-3-4.us-west-2.compute.internal",
                "kubernetes.io/role": "node",
                "node-role.kubernetes.io/node": ""
            },
            "annotations": {
                "node.alpha.kubernetes.io/ttl": "0",
                "volumes.kubernetes.io/controller-managed-attach-detach": "true"
            },
            "Status": {
                "Capacity": {
                    "alpha.kubernetes.io/nvidia-gpu": "0",
                    "cpu": "4",
                    "memory": "62884272Ki",
                    "pods": "110"
                },
                "Allocatable": {
                    "alpha.kubernetes.io/nvidia-gpu": "0",
                    "cpu": "4",
                    "memory": "62781872Ki",
                    "pods": "110"
                },

In case this gets applied at startup … I’ve tried terminating all the VMs in my cluster … no dice.

I’ve also tried doing kops edit ig ... for the gpu node to add the label alpha.kubernetes.io/nvidia-gpu-name="Tesla K80" and cycling the gpu node (terminating/allowing restart), again no dice.

While I did the kops update cluster... and kops rolling-update cluster ... I’m not sure if the Accelerators:true setting is taking effect.

If I look at my k8s api server pod … I see the startup command is …

      /usr/local/bin/kube-apiserver --address=127.0.0.1 --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,DefaultTolerationSeconds,ResourceQuota --allow-privileged=true --anonymous-auth=false --apiserver-count=1 --authorization-mode=AlwaysAllow --basic-auth-file=/srv/kubernetes/basic_auth.csv --client-ca-file=/srv/kubernetes/ca.crt --cloud-provider=aws --etcd-servers-overrides=/events#http://127.0.0.1:4002 --etcd-servers=http://127.0.0.1:4001 --insecure-port=8080 --kubelet-preferred-address-types=InternalIP,Hostname,ExternalIP,LegacyHostIP --secure-port=443 --service-cluster-ip-range=100.64.0.0/13 --storage-backend=etcd2 --tls-cert-file=/srv/kubernetes/server.cert --tls-private-key-file=/srv/kubernetes/server.key --token-auth-file=/srv/kubernetes/known_tokens.csv --v=2 1>>/var/log/kube-apiserver.log 2>&1

which doesn’t have the feature gates flag. So perhaps it’s not actually getting set?

I’m running client/server:

Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:44:38Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:22:08Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

and kops:

$kops version
Version 1.6.0-beta.1 (git-77f222d)

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 34 (13 by maintainers)

Most upvoted comments

awesome it worked

Just to help people around as I ran in to the same problem. The best solution for now is to create a custom AMI for gpu instancegroups unless it is handled in kops properly (currently gpu detection is perhaps only valid for p2 instances, and the race).

nvidia-smi -pm 1 || true
nvidia-smi -acp 0 || true
nvidia-smi --auto-boost-default=0 || true
nvidia-smi --auto-boost-permission=0 || true
nvidia-modprobe -u -c=0 -m || true

and kublet detects capacity properly when new igs are created

Thanks @diwu1989 - your details in the other ticket allowed me to get this up and running.

That said … I am looking forward to when kops can support this seamlessly.