harvester: [BUG] Unable to join new v1.2.0 node to v1.2.0 cluster
Describe the bug Provision fresh install v1.2.0 and join to existing v1.2.0 cluster
To Reproduce Steps to reproduce the behavior:
- Upgrade existing cluster to v1.2.0
- Install from ISO a new v1.2.0 node and configure to join cluster.
Expected behavior Joins cluster without error.
Support bundle It does not make it past rancherd bootstrap, no bundle to provide
Environment
- Harvester ISO version: v1.2.0
- Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): Dell Optiplex 3080 i5-12500t nvme 2TB let’s encrypt wildcard cert without IPss in SAN
Additional context It spams the logs with:
Sep 10 14:33:28 hv03 systemd[1]: Starting Rancher Bootstrap...
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="Loading config file [/usr/share/rancher/rancherd/config.yaml.d/50-defaults.yaml]"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="Loading config file [/usr/share/rancher/rancherd/config.yaml.d/91-harvester-bootstrap-repo.yaml]"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="Loading config file [/etc/rancher/rancherd/config.yaml]"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="Bootstrapping Rancher (v2.7.5/v1.25.9+rke2r1)"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="Writing plan file to /var/lib/rancher/rancherd/plan/plan.json"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="Applying plan with checksum "
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/rancherd/plan/work/20230910-143328-applied.plan/_0"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="Running command: /usr/bin/env [sh /var/lib/rancher/rancherd/install.sh]"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="[stdout]: [INFO] Using default agent configuration directory /etc/rancher/agent"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="[stdout]: [INFO] Using default agent var directory /var/lib/rancher/agent"
Sep 10 14:33:28 hv03 rancherd[2050]: time="2023-09-10T14:33:28Z" level=info msg="[stderr]: [WARN] /usr/local is read-only or a mount point; installing to /opt/rancher-system-agent"
Sep 10 14:33:29 hv03 rancherd[2050]: time="2023-09-10T14:33:29Z" level=info msg="[stdout]: [INFO] Determined CA is necessary to connect to Rancher"
Sep 10 14:33:29 hv03 rancherd[2050]: time="2023-09-10T14:33:29Z" level=info msg="[stdout]: [INFO] Successfully downloaded CA certificate"
Sep 10 14:33:29 hv03 rancherd[2050]: time="2023-09-10T14:33:29Z" level=info msg="[stdout]: [INFO] Value from https://10.21.10.10/cacerts is an x509 certificate"
Sep 10 14:33:29 hv03 rancherd[2050]: time="2023-09-10T14:33:29Z" level=info msg="[stderr]: curl: (60) SSL: no alternative certificate subject name matches target host name '10.21.10.10'"
Sep 10 14:33:29 hv03 rancherd[2050]: time="2023-09-10T14:33:29Z" level=info msg="[stderr]: More details here: https://curl.se/docs/sslcerts.html"
Sep 10 14:33:29 hv03 rancherd[2050]: time="2023-09-10T14:33:29Z" level=info msg="[stderr]: "
Checking install.sh I see:
hv03:/oem # head -n10 /var/lib/rancher/rancherd/install.sh
#!/usr/bin/env sh
CATTLE_AGENT_BINARY_BASE_URL="https://10.21.10.10/assets"
CATTLE_SERVER=https://10.21.10.10
CATTLE_CA_CHECKSUM="798db3fbaa5fe211159fa1cddb8e1e2cdbca12e3d355a491e6bf8134d2f14272"
This is the IP of the cluster VIP - harvester.slack.house. The IP was never entered during installation. Only ‘harvester.slack.house’ for vip.
No proxy was configured, or remote harvester config.
I do not understand why rancherd is placing these two variables at the top of install.sh, or why they are converting DNS to an IP 😦
Checking a v1.1.2 file only the first variable is set, and using the same value from the installer (dns).
I have not found a work around.
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 1
- Comments: 17 (9 by maintainers)
After syncing with @starbops - the work-around looks great with v1.2.0.
One thing noticed is that
rancherd.servicedid succeed in coming up on join, and the node was able to join the cluster as an additional host:But still
rancher-system-agent.servicedid see the similar behavior where the workaround would need to be implemented to succeed:Attaching the logs from both systemd services. Still need to verify with v1.1.2 - so will be doing that now 😄 👍
rancher-system-agent.service.log rancherd.service.log
Testing with v1.1.2, this looks good as well 😄 - I’ll go ahead and close this out 😄 👍 - thanks @starbops !
Test Plan
Basically, the following steps are distilled from harvester/docs#462. The issue happens on v1.1.2, too. So, ideally, we’d like to verify the documented workarounds work on v1.2 and v1.1 branches respectively.
ssl-certificatessetting with them (for the commands to generate those files, you can refer to the step 2)I followed the following steps to replicate the issue:
harvester.$VIP.sslip.ioaddress for VIP.server-urlfrom embedded rancher to join the node, which is currently pointing to the VIP, and results in the SAN error.kubectl edit setting.managenent server-urlto the DNS record associated with the new certificate.systemctl restart rancherd