kops: Unable to upgrade existing cluster from kops 1.22 to 1.23, master node name mismatch
/kind bug
1. What kops version are you running? The command kops version, will display
this information.
Version 1.23.0 (git-a067cd7742a497a5c512762b9880664d865289f1)
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
Client Version: version.Info{Major:“1”, Minor:“23”, GitVersion:“v1.23.4”, GitCommit:“e6c093d87ea4cbb530a7b2ae91e54c0842d8308a”, GitTreeState:“clean”, BuildDate:“2022-02-16T12:38:05Z”, GoVersion:“go1.17.7”, Compiler:“gc”, Platform:“linux/amd64”}
Server Version: version.Info{Major:“1”, Minor:“21”, GitVersion:“v1.21.3”, GitCommit:“ca643a4d1f7bfe34773c74f79527be4afd95bf39”, GitTreeState:“clean”, BuildDate:“2021-07-15T20:59:07Z”, GoVersion:“go1.16.6”, Compiler:“gc”, Platform:“linux/amd64”}
3. What cloud provider are you using? AWS
4. What commands did you run? What is the simplest way to reproduce this issue? kops rolling-update cluster --instance-group-roles master --yes
5. What happened after the commands executed? New master node was started, but it was not able to join the cluster
6. What did you expect to happen? New master would successfully join the cluster
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
It is too big to provide, and containing sensitive info, but I can provide the needed parts upon request
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know? The issue appeared when I was performing kops version update on existing cluster from 1.22.4 to 1.23.0. New manifests were successfully pushed to s3, but when I’ve started rolling masters, new master was not able to join the cluster. I’ve connected to that new master, looked into kubelet logs, and saw following messages, which I believe show the reason why new master was not starting properly
Mar 11 11:26:34 ip-10-209-111-17 kubelet[7346]: I0311 11:26:34.989673 7346 csi_plugin.go:1024] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "ip-10-209-111-17.eu-north-1.compute.internal" is forbidden: User "system:node:ip-10-209-111-17.domain.net" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.009924 7346 kubelet_node_status.go:362] "Setting node annotation to enable volume controller attach/detach"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.009965 7346 kubelet_node_status.go:410] "Adding label from cloud provider" labelKey="beta.kubernetes.io/instance-type" labelValue="c5.4xlarge"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.009988 7346 kubelet_node_status.go:412] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="c5.4xlarge"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.010008 7346 kubelet_node_status.go:423] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/zone" labelValue="eu-north-1a"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.010024 7346 kubelet_node_status.go:425] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/zone" labelValue="eu-north-1a"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.010045 7346 kubelet_node_status.go:429] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="eu-north-1"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.010062 7346 kubelet_node_status.go:431] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="eu-north-1"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.011200 7346 kubelet_node_status.go:554] "Recording event message for node" node="ip-10-209-111-17.eu-north-1.compute.internal" event="NodeHasSufficientMemory"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.011224 7346 kubelet_node_status.go:554] "Recording event message for node" node="ip-10-209-111-17.eu-north-1.compute.internal" event="NodeHasNoDiskPressure"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.011236 7346 kubelet_node_status.go:554] "Recording event message for node" node="ip-10-209-111-17.eu-north-1.compute.internal" event="NodeHasSufficientPID"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.011278 7346 kubelet_node_status.go:71] "Attempting to register node" node="ip-10-209-111-17.eu-north-1.compute.internal"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: E0311 11:26:35.020149 7346 kubelet_node_status.go:93] "Unable to register node with API server" err="nodes \"ip-10-209-111-17.eu-north-1.compute.internal\" is forbidden: node \"ip-10-209-111-17.domain.net\" is not allowed to modify node \"ip-10-209-111-17.eu-north-1.compute.internal\"" node="ip-10-209-111-17.eu-north-1.compute.internal"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: E0311 11:26:35.052287 7346 kubelet.go:2291] "Error getting node" err="node \"ip-10-209-111-17.eu-north-1.compute.internal\" not found"
I’ve checked changelog again, and I think that this change could be the root cause for the issue: Use AWS metadata to retrieve local-hostname in nodeup as on kops 1.22.4 I see following in kubelet log
Mar 10 13:19:56 ip-10-209-103-218 kubelet[7232]: I0310 13:19:56.877147 7232 flags.go:59] FLAG: --hostname-override="ip-10-209-103-218.eu-north-1.compute.internal"
and on kops 1.23.0 it is
Mar 11 11:22:57 ip-10-209-111-17 kubelet[7346]: I0311 11:22:57.523542 7346 flags.go:59] FLAG: --hostname-override="ip-10-209-111-17.domain.net"
This change was introduced in kops 1.23.0-beta.1, and I can confirm that kops 1.23.0-alpha.2 doesn’t have the issue.
Could you please fix the issue with upgrade for existing clusters one way or another? If you need some additional info, I would be happy to provide it
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 26 (13 by maintainers)
I’m sorry, but we are already using it for a long time (and non-kops part of VPC is relying on it), and it was working fine before kops 1.23
Then the pull above should be rolled back till 1.24, right? Because now this change is breaking existing clusters upgrade, and it is not even listed in breaking changes section