rancher: [BUG] `apply-system-agent-upgrader-windows` pod is in an error state in Windows RKE2 cluster
Rancher Server Setup
- Rancher version: 2.6.7-rc9
- Installation option (Docker install/Helm Chart): RKE`
Information about the Cluster
- Kubernetes version:
v1.24.2+rke2r1
- Cluster Type (Local/Downstream): Downstream Windows cluster - 1 etcd/control/worker nodes each Linux, 1 windows 2022 worker node
User Information
- What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) admin
Describe the bug
[BUG] apply-system-agent-upgrader-windows
pod is in an error state in Windows RKE2 cluster with rancher/wins:v0.4.5
image
To Reproduce
- Deploy a windows RKE2 cluster
rancher/system-upgrade-controller:v0.8.1
deployment fails to deploysystem-upgrade-controller-df77f4dfd-75dtl
is deployed successfully on the control plane linux node- Notice in the System project page,
apply-system-agent-upgrader-windows
pod is in an error state which is deployed on the windows node
Note:
- This issue is seen when monitoring is deployed on the cluster.
- Also seen when you upgrade RKE2 version from 1.23 to 1.24
(SURE-5462)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 16 (9 by maintainers)
Hey guys, do you have an update on this one?
I am able to reproduce this on rancher 2.6.8 new Custom RKE2 Cluster v1.24.4-rke2r1-85bc7d7bec85 rancher/system-upgrade-controller:v0.9.1 rancher/wins:v0.4.7
@marekvesely-direct This error message is unrelated to the issue reported in this GitHub issue, and can instead be triggered by a misconfigured tls-ca Secret in the cattle-system Namespace of the Rancher local cluster.
Where the Rancher TLS certificate is signed by a private CA, or requires intermediate certificate(s) to establish chain of trust to a publicly trusted root CA, the tls-ca secret should contain the root CA, followed by any intermediate certificates:(https://ranchermanager.docs.rancher.com/getting-started/installation-and-upgrade/resources/add-tls-secrets#using-a-private-ca-signed-certificate).
These cacerts are retrieved from Rancher by the cluster-agent in a downstream cluster, which populates the ca.crt field of the stv-aggregation Secret in the cattle-system Namespace with their contents.
The stv-aggregation Secret is then populated as environment variables in the upgrade container of the
apply-system-agent-upgrader-*
Job Pods via an envFrom reference:As a result, if the tls-ca Secret in the Rancher local cluster is misconfigured, for example with a complete list of an organizations CA certificates, the resulting environment variable in the
apply-system-agent-upgrader-*
Job can be too large, leading to anexec /opt/rancher-system-agent-suc/run.sh: argument list too long
attempting to run the Pod.This state can be resolved by recreating the tls-ca certificate with only the Rancher root CA (and any intermediate certificates) and performing a rollout restart of the Rancher deployment (https://ranchermanager.docs.rancher.com/getting-started/installation-and-upgrade/resources/update-rancher-certificate#2-createupdate-the-ca-certificate-secret-object).
Intended Test Setup
Test Approach
cc @markusewalker
Awaiting for PR https://github.com/rancher/rancher/pull/38663 to be merged. Once merged, QA will be able to come back and validate this issue with the changes @rosskirkpat has made.