harvester: [BUG] Unable to join new v1.2.0 node to v1.2.0 cluster

Describe the bug Provision fresh install v1.2.0 and join to existing v1.2.0 cluster

To Reproduce Steps to reproduce the behavior:

  1. Upgrade existing cluster to v1.2.0
  2. Install from ISO a new v1.2.0 node and configure to join cluster.

Expected behavior Joins cluster without error.

Support bundle It does not make it past rancherd bootstrap, no bundle to provide

Environment

  • Harvester ISO version: v1.2.0
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): Dell Optiplex 3080 i5-12500t nvme 2TB let’s encrypt wildcard cert without IPss in SAN

Additional context It spams the logs with:

Sep 10 14:33:28 hv03 systemd[1]: Starting Rancher Bootstrap...
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="Loading config file [/usr/share/rancher/rancherd/config.yaml.d/50-defaults.yaml]"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="Loading config file [/usr/share/rancher/rancherd/config.yaml.d/91-harvester-bootstrap-repo.yaml]"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="Loading config file [/etc/rancher/rancherd/config.yaml]"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="Bootstrapping Rancher (v2.7.5/v1.25.9+rke2r1)"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="Writing plan file to /var/lib/rancher/rancherd/plan/plan.json"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="Applying plan with checksum "
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/rancherd/plan/work/20230910-143328-applied.plan/_0"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="Running command: /usr/bin/env [sh /var/lib/rancher/rancherd/install.sh]"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="[stdout]: [INFO]  Using default agent configuration directory /etc/rancher/agent"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="[stdout]: [INFO]  Using default agent var directory /var/lib/rancher/agent"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="[stderr]: [WARN]  /usr/local is read-only or a mount point; installing to /opt/rancher-system-agent"
Sep 10 14:33:29 hv03 rancherd[2050]: time="2023-09-10T14:33:29Z" level=info msg="[stdout]: [INFO]  Determined CA is necessary to connect to Rancher"
Sep 10 14:33:29 hv03 rancherd[2050]: time="2023-09-10T14:33:29Z" level=info msg="[stdout]: [INFO]  Successfully downloaded CA certificate"
Sep 10 14:33:29 hv03 rancherd[2050]: time="2023-09-10T14:33:29Z" level=info msg="[stdout]: [INFO]  Value from https://10.21.10.10/cacerts is an x509 certificate"
Sep 10 14:33:29 hv03 rancherd[2050]: time="2023-09-10T14:33:29Z" level=info msg="[stderr]: curl: (60) SSL: no alternative certificate subject name matches target host name '10.21.10.10'"
Sep 10 14:33:29 hv03 rancherd[2050]: time="2023-09-10T14:33:29Z" level=info msg="[stderr]: More details here: https://curl.se/docs/sslcerts.html"
Sep 10 14:33:29 hv03 rancherd[2050]: time="2023-09-10T14:33:29Z" level=info msg="[stderr]: "

Checking install.sh I see:

hv03:/oem # head -n10 /var/lib/rancher/rancherd/install.sh
#!/usr/bin/env sh

CATTLE_AGENT_BINARY_BASE_URL="https://10.21.10.10/assets"
CATTLE_SERVER=https://10.21.10.10
CATTLE_CA_CHECKSUM="798db3fbaa5fe211159fa1cddb8e1e2cdbca12e3d355a491e6bf8134d2f14272"

This is the IP of the cluster VIP - harvester.slack.house. The IP was never entered during installation. Only ‘harvester.slack.house’ for vip.

No proxy was configured, or remote harvester config.

I do not understand why rancherd is placing these two variables at the top of install.sh, or why they are converting DNS to an IP 😦

Checking a v1.1.2 file only the first variable is set, and using the same value from the installer (dns).

I have not found a work around.

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Reactions: 1
  • Comments: 17 (9 by maintainers)

Most upvoted comments

After syncing with @starbops - the work-around looks great with v1.2.0.

One thing noticed is that rancherd.service did succeed in coming up on join, and the node was able to join the cluster as an additional host:

Oct 25 03:22:02 thinkcentre-test systemd[1]: Starting Rancher Bootstrap...
Oct 25 03:22:02 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:02Z" level=info msg="Loading config file [/usr/share/rancher/rancherd/config.yaml.d/50-defaults.yaml]"
Oct 25 03:22:02 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:02Z" level=info msg="Loading config file [/usr/share/rancher/rancherd/config.yaml.d/91-harvester-bootstrap-repo.yaml]"
Oct 25 03:22:02 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:02Z" level=info msg="Loading config file [/etc/rancher/rancherd/config.yaml]"
Oct 25 03:22:02 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:02Z" level=info msg="Bootstrapping Rancher (v2.7.5/v1.25.9+rke2r1)"
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="Writing plan file to /var/lib/rancher/rancherd/plan/plan.json"
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="Applying plan with checksum "
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/rancherd/plan/work/20231025-032203-applied.plan/_0"
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="Running command: /usr/bin/env [sh /var/lib/rancher/rancherd/install.sh]"
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="[stdout]: [INFO]  Using default agent configuration directory /etc/rancher/agent"
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="[stdout]: [INFO]  Using default agent var directory /var/lib/rancher/agent"
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="[stderr]: [WARN]  /usr/local is read-only or a mount point; installing to /opt/rancher-system-agent"
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="[stdout]: [INFO]  Determined CA is necessary to connect to Rancher"
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="[stdout]: [INFO]  Successfully downloaded CA certificate"
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="[stdout]: [INFO]  Value from https://192.168.9.239/cacerts is an x509 certificate"
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="[stdout]: [INFO]  Successfully tested Rancher connection"
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="[stdout]: [INFO]  Downloading rancher-system-agent binary from https://192.168.9.239/assets/rancher-system-agent-amd64"
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="[stdout]: [INFO]  Successfully downloaded the rancher-system-agent binary."
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="[stdout]: [INFO]  Downloading rancher-system-agent-uninstall.sh script from https://192.168.9.239/assets/system-agent-uninstall.sh"
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="[stdout]: [INFO]  Successfully downloaded the rancher-system-agent-uninstall.sh script."
Oct 25 03:22:03 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:03Z" level=info msg="[stdout]: [INFO]  Generating Cattle ID"
Oct 25 03:22:04 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:04Z" level=info msg="[stdout]: [INFO]  Successfully downloaded Rancher connection information"
Oct 25 03:22:04 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:04Z" level=info msg="[stdout]: [INFO]  systemd: Creating service file"
Oct 25 03:22:04 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:04Z" level=info msg="[stdout]: [INFO]  Creating environment file /etc/systemd/system/rancher-system-agent.env"
Oct 25 03:22:04 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:04Z" level=info msg="[stdout]: [INFO]  Enabling rancher-system-agent.service"
Oct 25 03:22:04 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:04Z" level=info msg="[stderr]: Created symlink /etc/systemd/system/multi-user.target.wants/rancher-system-agent.service → /etc/systemd/system/rancher-system-agent.s>
Oct 25 03:22:04 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:04Z" level=info msg="[stdout]: [INFO]  Starting/restarting rancher-system-agent.service"
Oct 25 03:22:04 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:04Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/rancherd/plan/work/20231025-032203-applied.plan/_1"
Oct 25 03:22:04 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:04Z" level=info msg="Running command: /usr/bin/rancherd [probe]"
Oct 25 03:22:04 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:04Z" level=info msg="[stderr]: time=\"2023-10-25T03:22:04Z\" level=info msg=\"Running probes defined in /var/lib/rancher/rancherd/plan/plan.json\""
Oct 25 03:22:06 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:06Z" level=info msg="[stderr]: time=\"2023-10-25T03:22:06Z\" level=info msg=\"Probe [kubelet] is unhealthy\""
Oct 25 03:22:10 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:10Z" level=info msg="[stderr]: time=\"2023-10-25T03:22:10Z\" level=info msg=\"Probe [kubelet] is healthy\""
Oct 25 03:22:10 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:10Z" level=info msg="[stderr]: time=\"2023-10-25T03:22:10Z\" level=info msg=\"All probes are healthy\""
Oct 25 03:22:10 thinkcentre-test rancherd[2022]: time="2023-10-25T03:22:10Z" level=info msg="Successfully Bootstrapped Rancher (v2.7.5/v1.25.9+rke2r1)"
Oct 25 03:22:10 thinkcentre-test systemd[1]: rancherd.service: Deactivated successfully.
Oct 25 03:22:10 thinkcentre-test systemd[1]: Finished Rancher Bootstrap.

But still rancher-system-agent.service did see the similar behavior where the workaround would need to be implemented to succeed:

Oct 25 03:32:51 thinkcentre-test systemd[1]: rancher-system-agent.service: Main process exited, code=exited, status=1/FAILURE
Oct 25 03:32:51 thinkcentre-test systemd[1]: rancher-system-agent.service: Failed with result 'exit-code'.
Oct 25 03:32:56 thinkcentre-test systemd[1]: rancher-system-agent.service: Scheduled restart job, restart counter is at 116.
Oct 25 03:32:56 thinkcentre-test systemd[1]: Stopped Rancher System Agent.
Oct 25 03:32:56 thinkcentre-test systemd[1]: Started Rancher System Agent.
Oct 25 03:32:56 thinkcentre-test rancher-system-agent[21595]: time="2023-10-25T03:32:56Z" level=info msg="Rancher System Agent version v0.3.3 (9e827a5) is starting"
Oct 25 03:32:56 thinkcentre-test rancher-system-agent[21595]: time="2023-10-25T03:32:56Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Oct 25 03:32:56 thinkcentre-test rancher-system-agent[21595]: time="2023-10-25T03:32:56Z" level=info msg="Starting remote watch of plans"
Oct 25 03:32:56 thinkcentre-test rancher-system-agent[21595]: time="2023-10-25T03:32:56Z" level=info msg="Initial connection to Kubernetes cluster failed with error Get \"https://192.168.9.239/version\": x509: certificate signed by unknown authority, removing CA data and trying again"
Oct 25 03:32:56 thinkcentre-test rancher-system-agent[21595]: time="2023-10-25T03:32:56Z" level=fatal msg="error while connecting to Kubernetes cluster with nullified CA data: Get \"https://192.168.9.239/version\": x509: certificate signed by unknown authority"
Oct 25 03:32:56 thinkcentre-test systemd[1]: rancher-system-agent.service: Main process exited, code=exited, status=1/FAILURE
Oct 25 03:32:56 thinkcentre-test systemd[1]: rancher-system-agent.service: Failed with result 'exit-code'.
Oct 25 03:33:01 thinkcentre-test systemd[1]: rancher-system-agent.service: Scheduled restart job, restart counter is at 117.
Oct 25 03:33:01 thinkcentre-test systemd[1]: Stopped Rancher System Agent.
Oct 25 03:33:01 thinkcentre-test systemd[1]: Started Rancher System Agent.
Oct 25 03:33:01 thinkcentre-test rancher-system-agent[21674]: time="2023-10-25T03:33:01Z" level=info msg="Rancher System Agent version v0.3.3 (9e827a5) is starting"
Oct 25 03:33:01 thinkcentre-test rancher-system-agent[21674]: time="2023-10-25T03:33:01Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Oct 25 03:33:01 thinkcentre-test rancher-system-agent[21674]: time="2023-10-25T03:33:01Z" level=info msg="Starting remote watch of plans"
Oct 25 03:33:01 thinkcentre-test rancher-system-agent[21674]: time="2023-10-25T03:33:01Z" level=info msg="Initial connection to Kubernetes cluster failed with error Get \"https://192.168.9.239/version\": x509: certificate signed by unknown authority, removing CA data and trying again"
Oct 25 03:33:01 thinkcentre-test rancher-system-agent[21674]: time="2023-10-25T03:33:01Z" level=fatal msg="error while connecting to Kubernetes cluster with nullified CA data: Get \"https://192.168.9.239/version\": x509: certificate signed by unknown authority"

Attaching the logs from both systemd services. Still need to verify with v1.1.2 - so will be doing that now 😄 👍

rancher-system-agent.service.log rancherd.service.log

Testing with v1.1.2, this looks good as well 😄 - I’ll go ahead and close this out 😄 👍 - thanks @starbops !

Test Plan

Basically, the following steps are distilled from harvester/docs#462. The issue happens on v1.1.2, too. So, ideally, we’d like to verify the documented workarounds work on v1.2 and v1.1 branches respectively.

  1. Prepare a v1.2.0/v1.1.2 Harvester cluster (single node is good enough)
  2. Prepare a set of private key, public certificate, and CA, and set up the ssl-certificates setting with them (for the commands to generate those files, you can refer to the step 2)
  3. Join a new node with the same version, finish the installation
  4. After the node rebooted, get access to the node via console or SSH and run the following command to add the CA to the node
# prepare the CA as embedded-rancher-ca.pem on the nodes
$ sudo cp embedded-rancher-ca.pem /etc/pki/trust/anchors/
$ sudo update-ca-certificates
  1. The node should join the cluster successfully

I followed the following steps to replicate the issue:

  • create a private CA and sign a cert for harvester.$VIP.sslip.io
  • update existing harvester v1.2.0 node to use the new certificate.
  • join a new node v1.2.0 to existing cluster. I used the harvester.$VIP.sslip.io address for VIP.
  • rancherd tries to use the server-url from embedded rancher to join the node, which is currently pointing to the VIP, and results in the SAN error.
Sep 18 04:52:58 upgrade-path-2 rancherd[2569]: time="2023-09-18T04:52:58Z" level=info msg="[stderr]: establish a secure connection to it. To learn more about this situation and"
Sep 18 04:52:58 upgrade-path-2 rancherd[2569]: time="2023-09-18T04:52:58Z" level=info msg="[stderr]: how to fix it, please visit the web page mentioned above."
Sep 18 04:52:58 upgrade-path-2 rancherd[2569]: time="2023-09-18T04:52:58Z" level=info msg="[stderr]: [ERROR]  000 received while testing Rancher connection. Sleeping for 5 seconds and trying again"
Sep 18 04:53:03 upgrade-path-2 rancherd[2569]: time="2023-09-18T04:53:03Z" level=info msg="[stderr]: curl: (60) SSL: no alternative certificate subject name matches target host name '192.168.122.237'"
Sep 18 04:53:03 upgrade-path-2 rancherd[2569]: time="2023-09-18T04:53:03Z" level=info msg="[stderr]: More details here: https://curl.se/docs/sslcerts.html"
Sep 18 04:53:03 upgrade-path-2 rancherd[2569]: time="2023-09-18T04:53:03Z" level=info msg="[stderr]: "
  • To work around this, update the setting for server-url, as follows kubectl edit setting.managenent server-url to the DNS record associated with the new certificate.
  • Restart rancherd on the joining node systemctl restart rancherd
  • Now rancher will use the join script with correct DNS record, for which a certificate already exists on ingress.
upgrade-path-2:~ # journalctl -fu rancherd
Sep 18 04:55:07 upgrade-path-2 rancherd[3163]: time="2023-09-18T04:55:07Z" level=info msg="[stdout]: [INFO]  Starting/restarting rancher-system-agent.service"
Sep 18 04:55:08 upgrade-path-2 rancherd[3163]: time="2023-09-18T04:55:08Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/rancherd/plan/work/20230918-045504-applied.plan/_1"
Sep 18 04:55:08 upgrade-path-2 rancherd[3163]: time="2023-09-18T04:55:08Z" level=info msg="Running command: /usr/bin/rancherd [probe]"
Sep 18 04:55:08 upgrade-path-2 rancherd[3163]: time="2023-09-18T04:55:08Z" level=info msg="[stderr]: time=\"2023-09-18T04:55:08Z\" level=info msg=\"Running probes defined in /var/lib/rancher/rancherd/plan/plan.json\""
Sep 18 04:55:09 upgrade-path-2 rancherd[3163]: time="2023-09-18T04:55:09Z" level=info msg="[stderr]: time=\"2023-09-18T04:55:09Z\" level=info msg=\"Probe [kubelet] is unhealthy\""
Sep 18 04:55:15 upgrade-path-2 rancherd[3163]: time="2023-09-18T04:55:15Z" level=info msg="[stderr]: time=\"2023-09-18T04:55:15Z\" level=info msg=\"Probe [kubelet] is healthy\""
Sep 18 04:55:15 upgrade-path-2 rancherd[3163]: time="2023-09-18T04:55:15Z" level=info msg="[stderr]: time=\"2023-09-18T04:55:15Z\" level=info msg=\"All probes are healthy\""
Sep 18 04:55:15 upgrade-path-2 rancherd[3163]: time="2023-09-18T04:55:15Z" level=info msg="Successfully Bootstrapped Rancher (v2.7.5/v1.25.9+rke2r1)"
Sep 18 04:55:15 upgrade-path-2 systemd[1]: rancherd.service: Deactivated successfully.
Sep 18 04:55:15 upgrade-path-2 systemd[1]: Finished Rancher Bootstrap.