rancher: [BUG] Rancher managed RKE2 clusters stuck in "Waiting for probes: kube-controller-manager, kube-scheduler"

As the issue occurs on quite every cluster we own, I try to sum up version information for all of them :

Rancher Server Setup

  • Rancher version: 2.6.9 and 2.7.1
  • Installation option (Docker install/Helm Chart):
    • HeIm chart on RKE2 cluster version 1.24
  • Proxy/Cert Details:

Information about the Cluster

  • Kubernetes version:
  • Cluster Type (Local/Downstream):
    • Downstream RKE2 Custom (version 1.24.7+rek2r1, 1.24.8+rke2r1, 1.24.9+rke2r2)

User Information

  • What is the role of the user logged in?
    • Admin

Describe the bug On all our clusters, upgrades are frozen with one/multiple controlplane node(s) stuck in state Waiting for probes: kube-controller-manager, kube-scheduler.

To Reproduce Not sure how to reproduce it, but all our clusters have hundreds days of run time, and they did not have any problem prior to Rancher 2.6.9 / Kubernetes version 1.24.7, but we have mix of cluster with Rancher 2.6.9 and 2.7.1 with clusters in 1.24.7 to 1.24.9. The problem appears after we edit the cluster to upgrade to a newer version.

Result

Upgrade is stuck, waiting for on or more controlpane stuck in state Waiting for probes: kube-controller-manager, kube-scheduler.

Expected Result

We expect upgrades to work without being stuck.

Screenshots

Ex cluster 1 (on controlplane impacted) : image

Ex cluster 2 (all controlplanes impacted) : image

Additional context We tried to upgrade a stuck cluster that was stuck going from 1.24.7 to 1.24.9, by upgrading to 1.24.11, nothing happened. When we restarted rancher-system-agent the node was effectively in 1.24.11, but still stuck in state Waiting for probes: kube-controller-manager, kube-scheduler. (SURE-6264)

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 6
  • Comments: 23 (4 by maintainers)

Most upvoted comments

Solution found ! Thanks for the hint Chris Kim and brandond on Rancher Community Slack !

There is no official solution right now, but here is the manual fix that worked for me, you may launch these commands on every nodes that is stuck in Waiting for probes: kube-controller-manager, kube-scheduler state (aka: controlplane nodes).

Here is a shell command to check if probes are okay or not :

(
curl  --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt \
  https://127.0.0.1:10257/healthz >/dev/null 2>&1 \
  && echo "[OK] Kube Controller probe" \
  || echo "[FAIL] Kube Controller probe";

curl --cacert /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt \
  https://127.0.0.1:10259/healthz >/dev/null 2>&1  \
  && echo "[OK] Scheduler probe" \
  || echo "[FAIL] Scheduler probe";
)

And below commands I used to force certificate rotation on failed probes :

echo "Rotating kube-controller-manager certificate"
rm /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.{crt,key}
crictl rm -f $(crictl ps -q --name kube-controller-manager)

echo "Rotating kube-scheduler certificate"
rm /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.{crt,key}
crictl rm -f $(crictl ps -q --name kube-scheduler)

Enjoy !

If you get this error :

Command 'crictl' not found, ...

You just have to set env var to make crictl available and working :

export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
export CONTAINERD_ADDRESS=unix:///run/k3s/containerd/containerd.sock
export PATH=$PATH:/var/lib/rancher/rke2/bin

It’s wonderfull that there is still no fix!

@AlexisDucastel I really appreciate your description and the steps followed when exposing this issue. It saved me a lot of time.

@AlexisDucastel - thanks for the bash steps above! 😄

That’s fixed the issue on all of our clusters 🥳

These certs would not be generated with a wide variety of rancher versions, and cluster versions.

All rke2 custom clusters, installed by rancher.

This was the fix.

timeout 1 openssl s_client -connect 127.0.0.1:10257 -showcerts 2>&1 | grep -A 19 -m 1 'BEGIN CERTIFICATE' | sudo tee /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt

timeout 1 openssl s_client -connect 127.0.0.1:10259 -showcerts 2>&1 | grep -A 19 -m 1 'BEGIN CERTIFICATE' | sudo tee /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt

@Rodri9o , I’m sorry to hear about your dissatisfaction with the support received.

However, please note that this issue is closed as fixed in 2.7.5 as per code changes in this PR the certificates that may result in cluster getting stuck on upgrade are now also rotated when using “Rotate Certificates” feature. It’s highly recommended to rotate certificates periodically (at least once a year) to ensure they do not expire.

The comment referenced is a workaround that can be applied if you’re running Rancher versions prior to 2.7.5 and/or got into the situation described in this issue.

Could you please clarify what your expectations for the fix? Please note that it’s not impossible for you to be running into some other issue that manifests similarly.

@emoxam , you reported quite recently that this issue is not fixed. Please feel free to chime in with more information.

@Josh-Diamond , with https://github.com/rancher/rancher/issues/41613 determined as the root cause for the reopen in “To Test” now, moving this one “To Test” as well. cc: @Sahota1225

I can confirm that @AlexisDucastel steps (CF: Step 1 ; then Step 2) solve the issue. Can someone from @rancher/all please look into this, or at least give official guidance ?

Note: This only affects rke2 custom clusters managed by Rancher, for cluster deployed manually with RKE2 and then imported into Rancher upgrades works without any issue.