harvester: [BUG] Upgrade to Harvester 1.2.0 fails in fleet-agent due to customer provided SSL certificate without IP SAN
Describe the bug
The harvester upgrade gets stuck in upgrading the helmchart based deployments (i.e. rancher-monitoring-crd) and we see this error in the fleet-agent pod in cattle-fleet-local-system:
- upgrade POD log, looping
Current version: 102.0.0+up40.1.2, Current state: WaitApplied, Current generation: 23
Sleep for 5 seconds to retry
- related resources
~ # kubectl get managedchart -A
NAMESPACE NAME AGE
fleet-local harvester 439d
fleet-local harvester-crd 439d
fleet-local local-managed-system-upgrade-controller 439d
fleet-local rancher-logging 318d
fleet-local rancher-logging-crd 318d
fleet-local rancher-monitoring 439d
fleet-local rancher-monitoring-crd 439d
# kubectl get bundle -A
NAMESPACE NAME BUNDLEDEPLOYMENTS-READY STATUS
fleet-local fleet-agent-local 1/1
fleet-local local-managed-system-agent 1/1
fleet-local mcc-harvester 1/1
fleet-local mcc-harvester-crd 1/1
fleet-local mcc-local-managed-system-upgrade-controller 1/1
fleet-local mcc-rancher-logging 0/1 OutOfSync(1) [Cluster fleet-local/local]
fleet-local mcc-rancher-logging-crd 0/1 OutOfSync(1) [Cluster fleet-local/local]
fleet-local mcc-rancher-monitoring 0/1 OutOfSync(1) [Cluster fleet-local/local]
fleet-local mcc-rancher-monitoring-crd 0/1 WaitApplied(1) [Cluster fleet-local/local]
- The reason is, the
fleet-agent-*POD has following error log, it can’t handle the sync of the related managedchart/bundles.
time="2023-09-12T10:17:20Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post \"https://192.168.0.34/apis/fleet.cattle.io/v1alpha1/namespaces/fleet-local/clusterregistrations\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.0.34 because it doesn't contain any IP SANs"
To Reproduce Deploy harvester 1.1.2 and add a customer provided SSL certificate that has a single DNS SAN that points to the VIP of harvester. Then upgrade to 1.2.0
Expected behavior Upgrade works.
Environment
- Harvester ISO version: 1.1.2 -> 1.2.0
- Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): not relevant
Additional context
Root cause seems to be that fleet-agent -> rancher communication happens over the setting settings.management.cattle.io -> server-url which has https://<ip> instead of https://<fqdn>
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 2
- Comments: 28 (13 by maintainers)
@w13915984028 we need a test plan in the comment before moving to ready-for-testing (either the full plan or link). And also a workaround doc in the known issue section: https://docs.harvesterhci.io/v1.2/upgrade/v1-1-2-to-v1-2-0
@w13915984028 The change in https://github.com/harvester/harvester/pull/4543 works, but we still need to address the fleet apiServerURL issue, here is the fleet-controller after upgrade:
The fleet-agent functions well because it has the correct setting (the fleet-controller doesn’t re-deploy it yet). We need fix the fleet-controller part.
e.g. the
apiServerURLishttps://harvester.example.comyeah, your FQDN
IMO we need an option during harvester installation and configuration that allows to specify the FQDN in addition to the VIP for harvester and the FQDN should be used by the fleet-agent and the SSL certificate should have the SAN for the FQDN.