kops: Unable to update cluster from 1.20 -> 1.21, error running task "BootstrapClientTask/BootstrapClient" failed to verify token (received status code 403 from STS)

/kind bug

1. What kops version are you running? The command kops version, will display this information. 1.21.1

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. 1.21.4

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue? Update existing cluster kops 1.20.2->1.21.1 and k8s 1.20.9->1.21.4.

5. What happened after the commands executed? The worker nodes never join the cluster (the control-plane started and runs with no issues). The kops-configuration.service on every node ends up failing after being unable to complete the below:

Sep 10 18:10:54 ip-172-17-8-239 nodeup[1371]: I0910 18:10:54.522051    1371 service.go:360] Enabling service "systemd-timesyncd"
Sep 10 18:10:54 ip-172-17-8-239 nodeup[1371]: W0910 18:10:54.689371    1371 executor.go:139] error running task "BootstrapClientTask/BootstrapClient" (9m30s remaining to succeed): bootstrap returned status code 403: failed to verify token: received status code 403 from STS: <ErrorResponse xmlns="https://sts.amazonaws.com/doc/2011-06-15/">
Sep 10 18:10:54 ip-172-17-8-239 nodeup[1371]: I0910 18:10:54.689404    1371 executor.go:111] Tasks: 73 done / 81 total; 1 can run
Sep 10 18:10:54 ip-172-17-8-239 nodeup[1371]: I0910 18:10:54.689426    1371 executor.go:186] Executing task "BootstrapClientTask/BootstrapClient": BootstrapClientTask
Sep 10 18:10:54 ip-172-17-8-239 nodeup[1371]: W0910 18:10:54.963891    1371 executor.go:139] error running task "BootstrapClientTask/BootstrapClient" (9m29s remaining to succeed): bootstrap returned status code 403: failed to verify token: received status code 403 from STS: <ErrorResponse xmlns="https://sts.amazonaws.com/doc/2011-06-15/">
Sep 10 18:10:54 ip-172-17-8-239 nodeup[1371]: I0910 18:10:54.963917    1371 executor.go:155] No progress made, sleeping before retrying 1 task(s)
Sep 10 18:11:04 ip-172-17-8-239 nodeup[1371]: I0910 18:11:04.965107    1371 executor.go:111] Tasks: 73 done / 81 total; 1 can run
Sep 10 18:11:04 ip-172-17-8-239 nodeup[1371]: I0910 18:11:04.965157    1371 executor.go:186] Executing task "BootstrapClientTask/BootstrapClient": BootstrapClientTask
Sep 10 18:11:05 ip-172-17-8-239 nodeup[1371]: W0910 18:11:05.268361    1371 executor.go:139] error running task "BootstrapClientTask/BootstrapClient" (9m19s remaining to succeed): bootstrap returned status code 403: failed to verify token: received status code 403 from STS: <ErrorResponse xmlns="https://sts.amazonaws.com/doc/2011-06-15/">
Sep 10 18:11:05 ip-172-17-8-239 nodeup[1371]: I0910 18:11:05.268387    1371 executor.go:155] No progress made, sleeping before retrying 1 task(s)
...

6. What did you expect to happen? New nodes to join the cluster.

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2019-11-19T15:28:59Z"
  generation: 23
  name: name.cluster.com
spec:
  additionalPolicies:
    master: '[{"Effect":"Allow","Action":["sts:AssumeRole"],"Resource":["arn:aws:iam::000000000000:role/elasticsearch_logs"]}]'
    node: '[{"Effect":"Allow","Action":["sts:AssumeRole"],"Resource":["arn:aws:iam::000000000000:role/elasticsearch_logs"]}]'
  additionalSans:
  - name.cluster.com.othername.com
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://state-store/name.cluster.com
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2c
      name: c
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2c
      name: c
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 10.119.1.0/24
  kubernetesVersion: 1.21.4
  masterInternalName: api.internal.name.cluster.com
  masterPublicName: api.name.cluster.com
  networkCIDR: 172.16.0.0/22
  networkID: vpc-00000000000000000
  networking:
    flannel:
      backend: vxlan
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 10.119.1.0/24
  subnets:
  - cidr: 172.16.0.0/27
    egress: nat-00000000000000000
    name: us-west-2a
    type: Private
    zone: us-west-2a
  - cidr: 172.16.0.32/27
    egress: nat-11111111111111111
    name: us-west-2b
    type: Private
    zone: us-west-2b
  - cidr: 172.16.0.64/27
    egress: nat-22222222222222222
    name: us-west-2c
    type: Private
    zone: us-west-2c
  - cidr: 172.16.0.96/28
    name: utility-us-west-2a
    type: Utility
    zone: us-west-2a
  - cidr: 172.16.0.112/28
    name: utility-us-west-2b
    type: Utility
    zone: us-west-2b
  - cidr: 172.16.0.128/28
    name: utility-us-west-2c
    type: Utility
    zone: us-west-2c
  topology:
    dns:
      type: Private
    masters: private
    nodes: private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2019-11-19T15:28:59Z"
  generation: 17
  labels:
    kops.k8s.io/cluster: name.cluster.com
  name: master-us-west-2c
spec:
  additionalSecurityGroups:
  - sg-00000000000000000
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210907
  machineType: t3a.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2c
  role: Master
  rootVolumeEncryption: true
  rootVolumeOptimization: true
  rootVolumeSize: 32
  subnets:
  - us-west-2c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2019-11-19T15:29:00Z"
  generation: 21
  labels:
    kops.k8s.io/cluster: name.cluster.com
  name: nodes
spec:
  additionalSecurityGroups:
  - sg-00000000000000000
  - sg-11111111111111111
  detailedInstanceMonitoring: true
  externalLoadBalancers:
  - targetGroupArn: arn:aws:elasticloadbalancing:us-west-2:000000000000:targetgroup/tg1/0000000000000000
  - targetGroupArn: arn:aws:elasticloadbalancing:us-west-2:000000000000:targetgroup/tg2/1111111111111111
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210907
  machineType: r5.2xlarge
  maxSize: 3
  minSize: 3
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
  role: Node
  rootVolumeEncryption: true
  rootVolumeOptimization: true
  rootVolumeSize: 100
  subnets:
  - us-west-2a
  - us-west-2b
  - us-west-2c
  suspendProcesses:
  - AZRebalance

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here. n/a

9. Anything else do we need to know? After manually updating the node’s “Launch template” to use the previous existing version (created with kops 1.20.2 / k8s 1.20.9) instead of “Latest”, the newly created nodes (running k8s 1.20.9) are still able to join the cluster.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 30 (9 by maintainers)

Most upvoted comments

For whoever faces this issue and has the disaster of cannot operate a production cluster: Downgrading Kops to 1.20.2 (Kops, not only K8S) fixes the issue. You can use this command: (after downloading Kops 1.20.2)

kops update cluster --allow-kops-downgrade --yes

then recreate the new worker nodes using the new Launch Template.

And for others: Please DO NOT use Kops 1.21, if you don’t want to face this issue like me.

This may happen if you roll nodes before the masters (or should CAS spin up a new node before all masters have rolled)

The reason for this is the switch to use regional STS endpoints. See https://github.com/kubernetes/kops/pull/12043 and associated issue.

@caiconkhicon I didn’t experience any issues updating from kops 1.19 (k8s 1.19) to kops 1.20 (k8s 1.20), and it was during that update that I also moved from docker to containerd and gp2 to gp3 (maybe you can try this gradual update instead of going from 1.19 to 1.21).

The only issues I got are the ones mentioned in this issue when updating kops 1.20 (k8s 1.20) to kops 1.21 (k8s 1.21):

  1. nodes running k8s 1.21 unable to join the cluster after the master was rolled-out for the first time
  2. flannel daemonset definition not updated: which maybe wasn’t actually a “real” issue updating from 1.20 to 1.21 because the selector is the only thing that changed here (unless there are some bootstrap processes that don’t complete because the kubectl update triggered by protokube command keeps failing). I still believe this should be sorted somehow (even if it’s just by documenting that flannel daemonset must be manually deleted) as it will certainly block future flannel daemonset updates for clusters created with kops <=1.20 as the daemonset definition can’t be “patched” due to the immutable selector field update.

NOTE 1: @olemarkus, another thing I tried was to reboot (instead of terminating) the updated master node (right after first rollout) but this didn’t seem to do the trick as new updated nodes were still not able to join the cluster. Terminating the master instance seemed to be the only way to trigger whatever made it work again and, just guessing here: maybe there’s some ETCD configuration/entry that only gets refreshed when an updated master gets “refreshed” for the second time?

NOTE 2: While I was experiencing the issues I also created a cluster from scratch using kops 1.22 with the exact same specs and everything started just fine, so it seems to be related with the cluster update only.

Hi @olemarkus, I actually always watch all pods running in the cluster (kubectl get pod -Aw) after the kops update cluster --yes at least for a couple of minutes and wait for any pods to be recreated (usually dns and flannel) before starting the rollout.

As a side note and maybe not related with this specific issue but I think it’s worth mentioning that there was a change in kops 1.21 which broke the kube-flannel-ds daemonset update as the selector immutable field got updated: role.kubernetes.io/networking: "1" is no longer present - commit b44065c, file upup/models/cloudup/resources/addons/networking.flannel/k8s-1.12.yaml.template.

The fix/workaround is actually pretty simple as one just needs to delete the flannel daemonset itself (kubectl --namespace=kube-system delete daemonsets.apps kube-flannel-ds) and it will get recreated in less than a minute by protokube.

@caiconkhicon, if @olemarkus command doesn’t do it, just try terminating the master instance(s) directly from AWS EC2 console or AWS CLI - the ASG will bring in a new master node and after the new master joins the cluster, updated nodes should be able to join the cluster too.

Some further updates after updating more than 15 kubernetes clusters: this happened to me in every k8s cluster which had a single master node and the networking was flannel-vxlan. Multi-master clusters running flannel-vxlan and single-master not using flannel (using kubenet) were not affected.

@salavessa : I am facing the exact same issue. Can you please specify the exact commands that you used? In my case, I also rolled all the master nodes already, but I still face the issue

Hi @olemarkus

The master node (cluster has only one) was actually updated, up and running (for many hours) before new nodes were created, so I can only imagine that there was some issue during master startup which failed to refresh/update whatever prevented nodes to join due to the reported error.

I’d like to thank you for the tip because I am now able to have the new nodes joining the cluster, but had to force a rolling-update of the existing master node.

Thanks!

** UPDATE: Just wanted to add that the exact same issue happened again to a different cluster and forcing a rolling-update to the master node also resolved the issue, so I’ll be adding that extra-step for this kops update to every cluster.