kops: Unable to update cluster from 1.20 -> 1.21, error running task "BootstrapClientTask/BootstrapClient" failed to verify token (received status code 403 from STS)
/kind bug
1. What kops version are you running? The command kops version, will display
this information.
1.21.1
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
1.21.4
3. What cloud provider are you using? AWS
4. What commands did you run? What is the simplest way to reproduce this issue? Update existing cluster kops 1.20.2->1.21.1 and k8s 1.20.9->1.21.4.
5. What happened after the commands executed?
The worker nodes never join the cluster (the control-plane started and runs with no issues).
The kops-configuration.service on every node ends up failing after being unable to complete the below:
Sep 10 18:10:54 ip-172-17-8-239 nodeup[1371]: I0910 18:10:54.522051 1371 service.go:360] Enabling service "systemd-timesyncd"
Sep 10 18:10:54 ip-172-17-8-239 nodeup[1371]: W0910 18:10:54.689371 1371 executor.go:139] error running task "BootstrapClientTask/BootstrapClient" (9m30s remaining to succeed): bootstrap returned status code 403: failed to verify token: received status code 403 from STS: <ErrorResponse xmlns="https://sts.amazonaws.com/doc/2011-06-15/">
Sep 10 18:10:54 ip-172-17-8-239 nodeup[1371]: I0910 18:10:54.689404 1371 executor.go:111] Tasks: 73 done / 81 total; 1 can run
Sep 10 18:10:54 ip-172-17-8-239 nodeup[1371]: I0910 18:10:54.689426 1371 executor.go:186] Executing task "BootstrapClientTask/BootstrapClient": BootstrapClientTask
Sep 10 18:10:54 ip-172-17-8-239 nodeup[1371]: W0910 18:10:54.963891 1371 executor.go:139] error running task "BootstrapClientTask/BootstrapClient" (9m29s remaining to succeed): bootstrap returned status code 403: failed to verify token: received status code 403 from STS: <ErrorResponse xmlns="https://sts.amazonaws.com/doc/2011-06-15/">
Sep 10 18:10:54 ip-172-17-8-239 nodeup[1371]: I0910 18:10:54.963917 1371 executor.go:155] No progress made, sleeping before retrying 1 task(s)
Sep 10 18:11:04 ip-172-17-8-239 nodeup[1371]: I0910 18:11:04.965107 1371 executor.go:111] Tasks: 73 done / 81 total; 1 can run
Sep 10 18:11:04 ip-172-17-8-239 nodeup[1371]: I0910 18:11:04.965157 1371 executor.go:186] Executing task "BootstrapClientTask/BootstrapClient": BootstrapClientTask
Sep 10 18:11:05 ip-172-17-8-239 nodeup[1371]: W0910 18:11:05.268361 1371 executor.go:139] error running task "BootstrapClientTask/BootstrapClient" (9m19s remaining to succeed): bootstrap returned status code 403: failed to verify token: received status code 403 from STS: <ErrorResponse xmlns="https://sts.amazonaws.com/doc/2011-06-15/">
Sep 10 18:11:05 ip-172-17-8-239 nodeup[1371]: I0910 18:11:05.268387 1371 executor.go:155] No progress made, sleeping before retrying 1 task(s)
...
6. What did you expect to happen? New nodes to join the cluster.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
creationTimestamp: "2019-11-19T15:28:59Z"
generation: 23
name: name.cluster.com
spec:
additionalPolicies:
master: '[{"Effect":"Allow","Action":["sts:AssumeRole"],"Resource":["arn:aws:iam::000000000000:role/elasticsearch_logs"]}]'
node: '[{"Effect":"Allow","Action":["sts:AssumeRole"],"Resource":["arn:aws:iam::000000000000:role/elasticsearch_logs"]}]'
additionalSans:
- name.cluster.com.othername.com
api:
dns: {}
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://state-store/name.cluster.com
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- encryptedVolume: true
instanceGroup: master-us-west-2c
name: c
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- encryptedVolume: true
instanceGroup: master-us-west-2c
name: c
memoryRequest: 100Mi
name: events
iam:
allowContainerRegistry: true
legacy: false
kubelet:
anonymousAuth: false
kubernetesApiAccess:
- 10.119.1.0/24
kubernetesVersion: 1.21.4
masterInternalName: api.internal.name.cluster.com
masterPublicName: api.name.cluster.com
networkCIDR: 172.16.0.0/22
networkID: vpc-00000000000000000
networking:
flannel:
backend: vxlan
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 10.119.1.0/24
subnets:
- cidr: 172.16.0.0/27
egress: nat-00000000000000000
name: us-west-2a
type: Private
zone: us-west-2a
- cidr: 172.16.0.32/27
egress: nat-11111111111111111
name: us-west-2b
type: Private
zone: us-west-2b
- cidr: 172.16.0.64/27
egress: nat-22222222222222222
name: us-west-2c
type: Private
zone: us-west-2c
- cidr: 172.16.0.96/28
name: utility-us-west-2a
type: Utility
zone: us-west-2a
- cidr: 172.16.0.112/28
name: utility-us-west-2b
type: Utility
zone: us-west-2b
- cidr: 172.16.0.128/28
name: utility-us-west-2c
type: Utility
zone: us-west-2c
topology:
dns:
type: Private
masters: private
nodes: private
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2019-11-19T15:28:59Z"
generation: 17
labels:
kops.k8s.io/cluster: name.cluster.com
name: master-us-west-2c
spec:
additionalSecurityGroups:
- sg-00000000000000000
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210907
machineType: t3a.medium
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-us-west-2c
role: Master
rootVolumeEncryption: true
rootVolumeOptimization: true
rootVolumeSize: 32
subnets:
- us-west-2c
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2019-11-19T15:29:00Z"
generation: 21
labels:
kops.k8s.io/cluster: name.cluster.com
name: nodes
spec:
additionalSecurityGroups:
- sg-00000000000000000
- sg-11111111111111111
detailedInstanceMonitoring: true
externalLoadBalancers:
- targetGroupArn: arn:aws:elasticloadbalancing:us-west-2:000000000000:targetgroup/tg1/0000000000000000
- targetGroupArn: arn:aws:elasticloadbalancing:us-west-2:000000000000:targetgroup/tg2/1111111111111111
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210907
machineType: r5.2xlarge
maxSize: 3
minSize: 3
nodeLabels:
kops.k8s.io/instancegroup: nodes
role: Node
rootVolumeEncryption: true
rootVolumeOptimization: true
rootVolumeSize: 100
subnets:
- us-west-2a
- us-west-2b
- us-west-2c
suspendProcesses:
- AZRebalance
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
n/a
9. Anything else do we need to know? After manually updating the node’s “Launch template” to use the previous existing version (created with kops 1.20.2 / k8s 1.20.9) instead of “Latest”, the newly created nodes (running k8s 1.20.9) are still able to join the cluster.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 30 (9 by maintainers)
For whoever faces this issue and has the disaster of cannot operate a production cluster: Downgrading Kops to 1.20.2 (Kops, not only K8S) fixes the issue. You can use this command: (after downloading Kops 1.20.2)
then recreate the new worker nodes using the new Launch Template.
And for others: Please DO NOT use Kops 1.21, if you don’t want to face this issue like me.
This may happen if you roll nodes before the masters (or should CAS spin up a new node before all masters have rolled)
The reason for this is the switch to use regional STS endpoints. See https://github.com/kubernetes/kops/pull/12043 and associated issue.
@caiconkhicon I didn’t experience any issues updating from
kops 1.19 (k8s 1.19)tokops 1.20 (k8s 1.20), and it was during that update that I also moved fromdockertocontainerdandgp2togp3(maybe you can try this gradual update instead of going from 1.19 to 1.21).The only issues I got are the ones mentioned in this issue when updating
kops 1.20 (k8s 1.20)tokops 1.21 (k8s 1.21):kubectl updatetriggered byprotokubecommand keeps failing). I still believe this should be sorted somehow (even if it’s just by documenting that flannel daemonset must be manually deleted) as it will certainly block future flannel daemonset updates for clusters created withkops <=1.20as the daemonset definition can’t be “patched” due to the immutable selector field update.NOTE 1: @olemarkus, another thing I tried was to reboot (instead of terminating) the updated master node (right after first rollout) but this didn’t seem to do the trick as new updated nodes were still not able to join the cluster. Terminating the master instance seemed to be the only way to trigger whatever made it work again and, just guessing here: maybe there’s some ETCD configuration/entry that only gets refreshed when an updated master gets “refreshed” for the second time?
NOTE 2: While I was experiencing the issues I also created a cluster from scratch using
kops 1.22with the exact same specs and everything started just fine, so it seems to be related with the cluster update only.Hi @olemarkus, I actually always watch all pods running in the cluster (
kubectl get pod -Aw) after thekops update cluster --yesat least for a couple of minutes and wait for any pods to be recreated (usually dns and flannel) before starting the rollout.As a side note and maybe not related with this specific issue but I think it’s worth mentioning that there was a change in
kops 1.21which broke thekube-flannel-dsdaemonset update as theselectorimmutable field got updated:role.kubernetes.io/networking: "1"is no longer present - commit b44065c, fileupup/models/cloudup/resources/addons/networking.flannel/k8s-1.12.yaml.template.The fix/workaround is actually pretty simple as one just needs to delete the flannel daemonset itself (
kubectl --namespace=kube-system delete daemonsets.apps kube-flannel-ds) and it will get recreated in less than a minute byprotokube.@caiconkhicon, if @olemarkus command doesn’t do it, just try terminating the master instance(s) directly from AWS EC2 console or AWS CLI - the ASG will bring in a new master node and after the new master joins the cluster, updated nodes should be able to join the cluster too.
Some further updates after updating more than 15 kubernetes clusters: this happened to me in every k8s cluster which had a single master node and the networking was flannel-vxlan. Multi-master clusters running flannel-vxlan and single-master not using flannel (using kubenet) were not affected.
@salavessa : I am facing the exact same issue. Can you please specify the exact commands that you used? In my case, I also rolled all the master nodes already, but I still face the issue
Hi @olemarkus
The master node (cluster has only one) was actually updated, up and running (for many hours) before new nodes were created, so I can only imagine that there was some issue during master startup which failed to refresh/update whatever prevented nodes to join due to the reported error.
I’d like to thank you for the tip because I am now able to have the new nodes joining the cluster, but had to force a rolling-update of the existing master node.
Thanks!
** UPDATE: Just wanted to add that the exact same issue happened again to a different cluster and forcing a rolling-update to the master node also resolved the issue, so I’ll be adding that extra-step for this kops update to every cluster.