kops: Unable to upgrade existing cluster from kops 1.22 to 1.23, master node name mismatch

/kind bug

1. What kops version are you running? The command kops version, will display this information. Version 1.23.0 (git-a067cd7742a497a5c512762b9880664d865289f1)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. Client Version: version.Info{Major:“1”, Minor:“23”, GitVersion:“v1.23.4”, GitCommit:“e6c093d87ea4cbb530a7b2ae91e54c0842d8308a”, GitTreeState:“clean”, BuildDate:“2022-02-16T12:38:05Z”, GoVersion:“go1.17.7”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“21”, GitVersion:“v1.21.3”, GitCommit:“ca643a4d1f7bfe34773c74f79527be4afd95bf39”, GitTreeState:“clean”, BuildDate:“2021-07-15T20:59:07Z”, GoVersion:“go1.16.6”, Compiler:“gc”, Platform:“linux/amd64”}

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue? kops rolling-update cluster --instance-group-roles master --yes

5. What happened after the commands executed? New master node was started, but it was not able to join the cluster

6. What did you expect to happen? New master would successfully join the cluster

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information. It is too big to provide, and containing sensitive info, but I can provide the needed parts upon request

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know? The issue appeared when I was performing kops version update on existing cluster from 1.22.4 to 1.23.0. New manifests were successfully pushed to s3, but when I’ve started rolling masters, new master was not able to join the cluster. I’ve connected to that new master, looked into kubelet logs, and saw following messages, which I believe show the reason why new master was not starting properly

Mar 11 11:26:34 ip-10-209-111-17 kubelet[7346]: I0311 11:26:34.989673    7346 csi_plugin.go:1024] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "ip-10-209-111-17.eu-north-1.compute.internal" is forbidden: User "system:node:ip-10-209-111-17.domain.net" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.009924    7346 kubelet_node_status.go:362] "Setting node annotation to enable volume controller attach/detach"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.009965    7346 kubelet_node_status.go:410] "Adding label from cloud provider" labelKey="beta.kubernetes.io/instance-type" labelValue="c5.4xlarge"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.009988    7346 kubelet_node_status.go:412] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="c5.4xlarge"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.010008    7346 kubelet_node_status.go:423] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/zone" labelValue="eu-north-1a"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.010024    7346 kubelet_node_status.go:425] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/zone" labelValue="eu-north-1a"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.010045    7346 kubelet_node_status.go:429] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="eu-north-1"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.010062    7346 kubelet_node_status.go:431] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="eu-north-1"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.011200    7346 kubelet_node_status.go:554] "Recording event message for node" node="ip-10-209-111-17.eu-north-1.compute.internal" event="NodeHasSufficientMemory"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.011224    7346 kubelet_node_status.go:554] "Recording event message for node" node="ip-10-209-111-17.eu-north-1.compute.internal" event="NodeHasNoDiskPressure"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.011236    7346 kubelet_node_status.go:554] "Recording event message for node" node="ip-10-209-111-17.eu-north-1.compute.internal" event="NodeHasSufficientPID"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: I0311 11:26:35.011278    7346 kubelet_node_status.go:71] "Attempting to register node" node="ip-10-209-111-17.eu-north-1.compute.internal"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: E0311 11:26:35.020149    7346 kubelet_node_status.go:93] "Unable to register node with API server" err="nodes \"ip-10-209-111-17.eu-north-1.compute.internal\" is forbidden: node \"ip-10-209-111-17.domain.net\" is not allowed to modify node \"ip-10-209-111-17.eu-north-1.compute.internal\"" node="ip-10-209-111-17.eu-north-1.compute.internal"
Mar 11 11:26:35 ip-10-209-111-17 kubelet[7346]: E0311 11:26:35.052287    7346 kubelet.go:2291] "Error getting node" err="node \"ip-10-209-111-17.eu-north-1.compute.internal\" not found"

I’ve checked changelog again, and I think that this change could be the root cause for the issue: Use AWS metadata to retrieve local-hostname in nodeup as on kops 1.22.4 I see following in kubelet log

Mar 10 13:19:56 ip-10-209-103-218 kubelet[7232]: I0310 13:19:56.877147    7232 flags.go:59] FLAG: --hostname-override="ip-10-209-103-218.eu-north-1.compute.internal"

and on kops 1.23.0 it is

Mar 11 11:22:57 ip-10-209-111-17 kubelet[7346]: I0311 11:22:57.523542    7346 flags.go:59] FLAG: --hostname-override="ip-10-209-111-17.domain.net"

This change was introduced in kops 1.23.0-beta.1, and I can confirm that kops 1.23.0-alpha.2 doesn’t have the issue.

Could you please fix the issue with upgrade for existing clusters one way or another? If you need some additional info, I would be happy to provide it

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 26 (13 by maintainers)

Most upvoted comments

Please do not use a custom domain name associated with the VPC

I’m sorry, but we are already using it for a long time (and non-kops part of VPC is relying on it), and it was working fine before kops 1.23

In 1.24, we will transition away from the old AWS CCM code, where the node names will be based on instance IDs instead.

Then the pull above should be rolled back till 1.24, right? Because now this change is breaking existing clusters upgrade, and it is not even listed in breaking changes section

ValeriiVozniuk on Mar 14, 2022