harvester: [BUG] Upgrade to Harvester 1.2.0 fails in fleet-agent due to customer provided SSL certificate without IP SAN

Describe the bug

The harvester upgrade gets stuck in upgrading the helmchart based deployments (i.e. rancher-monitoring-crd) and we see this error in the fleet-agent pod in cattle-fleet-local-system:

  1. upgrade POD log, looping
Current version: 102.0.0+up40.1.2, Current state: WaitApplied, Current generation: 23
Sleep for 5 seconds to retry
  1. related resources
~ # kubectl get managedchart -A
NAMESPACE   NAME                   AGE
fleet-local  harvester                 439d
fleet-local  harvester-crd               439d
fleet-local  local-managed-system-upgrade-controller  439d
fleet-local  rancher-logging              318d
fleet-local  rancher-logging-crd            318d
fleet-local  rancher-monitoring            439d
fleet-local  rancher-monitoring-crd          439d

 # kubectl get bundle -A
NAMESPACE   NAME                     BUNDLEDEPLOYMENTS-READY  STATUS
fleet-local  fleet-agent-local               1/1            
fleet-local  local-managed-system-agent          1/1            
fleet-local  mcc-harvester                 1/1            
fleet-local  mcc-harvester-crd               1/1            
fleet-local  mcc-local-managed-system-upgrade-controller  1/1            
fleet-local  mcc-rancher-logging              0/1            OutOfSync(1) [Cluster fleet-local/local]
fleet-local  mcc-rancher-logging-crd            0/1            OutOfSync(1) [Cluster fleet-local/local]
fleet-local  mcc-rancher-monitoring            0/1            OutOfSync(1) [Cluster fleet-local/local]
fleet-local  mcc-rancher-monitoring-crd          0/1            WaitApplied(1) [Cluster fleet-local/local]
  1. The reason is, the fleet-agent-* POD has following error log, it can’t handle the sync of the related managedchart/bundles.

time="2023-09-12T10:17:20Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post \"https://192.168.0.34/apis/fleet.cattle.io/v1alpha1/namespaces/fleet-local/clusterregistrations\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.0.34 because it doesn't contain any IP SANs"

To Reproduce Deploy harvester 1.1.2 and add a customer provided SSL certificate that has a single DNS SAN that points to the VIP of harvester. Then upgrade to 1.2.0

Expected behavior Upgrade works.

Environment

  • Harvester ISO version: 1.1.2 -> 1.2.0
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): not relevant

Additional context Root cause seems to be that fleet-agent -> rancher communication happens over the setting settings.management.cattle.io -> server-url which has https://<ip> instead of https://<fqdn>

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Reactions: 2
  • Comments: 28 (13 by maintainers)

Most upvoted comments

@w13915984028 we need a test plan in the comment before moving to ready-for-testing (either the full plan or link). And also a workaround doc in the known issue section: https://docs.harvesterhci.io/v1.2/upgrade/v1-1-2-to-v1-2-0

@w13915984028 The change in https://github.com/harvester/harvester/pull/4543 works, but we still need to address the fleet apiServerURL issue, here is the fleet-controller after upgrade:

time="2023-09-20T10:19:24Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
time="2023-09-20T10:19:24Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"
time="2023-09-20T10:19:24Z" level=error msg="error syncing 'fleet-local/local': handler import-cluster: missing apiServerURL in fleet config for cluster auto registration, requeuing"
time="2023-09-20T10:19:24Z" level=error msg="error syncing 'fleet-local/local': handler import-cluster: missing apiServerURL in fleet config for cluster auto registration, requeuing"
time="2023-09-20T10:19:26Z" level=error msg="error syncing 'fleet-local/local': handler import-cluster: missing apiServerURL in fleet config for cluster auto registration, requeuing"

The fleet-agent functions well because it has the correct setting (the fleet-controller doesn’t re-deploy it yet). We need fix the fleet-controller part.

e.g. the apiServerURL is https://harvester.example.com

echo "https://harvester.example.com" | base64
aHR0cHM6Ly9oYXJ2ZXN0ZXIuZXhhbXBsZS5jb20K

yeah, your FQDN

IMO we need an option during harvester installation and configuration that allows to specify the FQDN in addition to the VIP for harvester and the FQDN should be used by the fleet-agent and the SSL certificate should have the SAN for the FQDN.