harvester: [BUG] Upgrade to Harvester 1.2.0 fails in fleet-agent due to customer provided SSL certificate without IP SAN

Describe the bug

The harvester upgrade gets stuck in upgrading the helmchart based deployments (i.e. rancher-monitoring-crd) and we see this error in the fleet-agent pod in cattle-fleet-local-system:

upgrade POD log, looping

Current version: 102.0.0+up40.1.2, Current state: WaitApplied, Current generation: 23
Sleep for 5 seconds to retry

related resources

~ # kubectl get managedchart -A
NAMESPACE   NAME                   AGE
fleet-local  harvester                 439d
fleet-local  harvester-crd               439d
fleet-local  local-managed-system-upgrade-controller  439d
fleet-local  rancher-logging              318d
fleet-local  rancher-logging-crd            318d
fleet-local  rancher-monitoring            439d
fleet-local  rancher-monitoring-crd          439d

 # kubectl get bundle -A
NAMESPACE   NAME                     BUNDLEDEPLOYMENTS-READY  STATUS
fleet-local  fleet-agent-local               1/1            
fleet-local  local-managed-system-agent          1/1            
fleet-local  mcc-harvester                 1/1            
fleet-local  mcc-harvester-crd               1/1            
fleet-local  mcc-local-managed-system-upgrade-controller  1/1            
fleet-local  mcc-rancher-logging              0/1            OutOfSync(1) [Cluster fleet-local/local]
fleet-local  mcc-rancher-logging-crd            0/1            OutOfSync(1) [Cluster fleet-local/local]
fleet-local  mcc-rancher-monitoring            0/1            OutOfSync(1) [Cluster fleet-local/local]
fleet-local  mcc-rancher-monitoring-crd          0/1            WaitApplied(1) [Cluster fleet-local/local]

The reason is, the fleet-agent-* POD has following error log, it can’t handle the sync of the related managedchart/bundles.

time="2023-09-12T10:17:20Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post \"https://192.168.0.34/apis/fleet.cattle.io/v1alpha1/namespaces/fleet-local/clusterregistrations\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.0.34 because it doesn't contain any IP SANs"

To Reproduce Deploy harvester 1.1.2 and add a customer provided SSL certificate that has a single DNS SAN that points to the VIP of harvester. Then upgrade to 1.2.0

Expected behavior Upgrade works.

Environment

Harvester ISO version: 1.1.2 -> 1.2.0
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): not relevant

Additional context Root cause seems to be that fleet-agent -> rancher communication happens over the setting settings.management.cattle.io -> server-url which has https://<ip> instead of https://<fqdn>

About this issue

Original URL
State: closed
Created 10 months ago
Reactions: 2
Comments: 28 (13 by maintainers)

Most upvoted comments

@w13915984028 we need a test plan in the comment before moving to ready-for-testing (either the full plan or link). And also a workaround doc in the known issue section: https://docs.harvesterhci.io/v1.2/upgrade/v1-1-2-to-v1-2-0

bk201 on Oct 11, 2023

@w13915984028 The change in https://github.com/harvester/harvester/pull/4543 works, but we still need to address the fleet apiServerURL issue, here is the fleet-controller after upgrade:

time="2023-09-20T10:19:24Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
time="2023-09-20T10:19:24Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"
time="2023-09-20T10:19:24Z" level=error msg="error syncing 'fleet-local/local': handler import-cluster: missing apiServerURL in fleet config for cluster auto registration, requeuing"
time="2023-09-20T10:19:24Z" level=error msg="error syncing 'fleet-local/local': handler import-cluster: missing apiServerURL in fleet config for cluster auto registration, requeuing"
time="2023-09-20T10:19:26Z" level=error msg="error syncing 'fleet-local/local': handler import-cluster: missing apiServerURL in fleet config for cluster auto registration, requeuing"

The fleet-agent functions well because it has the correct setting (the fleet-controller doesn’t re-deploy it yet). We need fix the fleet-controller part.

bk201 on Sep 20, 2023

e.g. the apiServerURL is https://harvester.example.com

echo "https://harvester.example.com" | base64
aHR0cHM6Ly9oYXJ2ZXN0ZXIuZXhhbXBsZS5jb20K

w13915984028 on Sep 12, 2023

yeah, your FQDN

w13915984028 on Sep 12, 2023

IMO we need an option during harvester installation and configuration that allows to specify the FQDN in addition to the VIP for harvester and the FQDN should be used by the fleet-agent and the SSL certificate should have the SAN for the FQDN.

Martin-Weiss on Sep 12, 2023