rancher: node upgrade failed with 'node not found' in v2.6.4

Rancher Server Setup

  • Rancher version: v2.6.4
  • Installation option (Docker install/Helm Chart):
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): helm/v1.18.18 provisioned by rke
  • Proxy/Cert Details:N/A

Information about the Cluster

  • Kubernetes version: v1.20.11
  • Cluster Type (Local/Downstream):
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Custom Downstream running on EC2

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • If custom, define the set of permissions: Admin

Describe the bug

We tried to operate the cluster via terraform and it got stuck with error below.

[controlPlane] Failed to upgrade Control Plane: [[[controlplane] Error getting node ip-xxxx-us-east-2.compute.internal: "ip-xxx-us-east-2.compute.internal" not found]]

The node actually exists and it turns out to be the ‘kubernetes.io/hostname’ is the short hostname but the rancher validates the node existence according to the full hostname. Prior to v2.6.4 the parameter ‘–node-name “$(hostname -f)”’ we added to the node registration command worked fine. However in rancher v2.6.4, it doesn’t work any more and returned such error. We have to manually patch the node label to unblock the operation. We found out there was one commit which deleted the parameter ‘hostname-override’ in rke 1.3.8 https://github.com/rancher/rke/pull/2803

To Reproduce

  • install rancher v2.6.4
  • provision a downstream cluster and add ‘–node-name’ to overwrite the short hostname

Result

the label ‘kubernetes.io/hostname’ of the registered node is the short one instead of the full hostname. rancher may report error ‘node not found’

Expected Result

rancher handle the inconsistence of the values smoothly. If rancher generate the ‘node.status.hostnameOverride’ according to the “RequestedHostName”, it shouldn’t remove the hostname user specified when generate the nodeplan. Else rancher shall bypass the same to validate. I’m not sure what’s the story behind this design and found it might be related with below code but feel free to correct me nodecreation: https://github.com/rancher/rancher/blob/v2.6.4/pkg/controllers/management/node/controller.go#L124 nodeplan: https://github.com/rancher/rke/blob/v1.3.8/cluster/plan.go#L477 validation: https://github.com/rancher/rke/blob/v1.3.8/k8s/node.go#L55 Screenshots

Additional context

SURE-4375 SURE-4912

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 17 (14 by maintainers)

Most upvoted comments

Setting the milestone to v2.6.6 to increase the visibility and add it to our team’s queue; not a commitment to deliver the fix in 2.6.6.

Available to test with RKE v1.3.18-rc3 and Rancher v2.6-head once https://drone-publish.rancher.io/rancher/rancher/8841 is green.