rancher: [BUG] `apply-system-agent-upgrader-windows` pod is in an error state in Windows RKE2 cluster

Rancher Server Setup

  • Rancher version: 2.6.7-rc9
  • Installation option (Docker install/Helm Chart): RKE`

Information about the Cluster

  • Kubernetes version: v1.24.2+rke2r1
  • Cluster Type (Local/Downstream): Downstream Windows cluster - 1 etcd/control/worker nodes each Linux, 1 windows 2022 worker node

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) admin

Describe the bug [BUG] apply-system-agent-upgrader-windows pod is in an error state in Windows RKE2 cluster with rancher/wins:v0.4.5 image

To Reproduce

  • Deploy a windows RKE2 cluster
  • rancher/system-upgrade-controller:v0.8.1 deployment fails to deploy
  • system-upgrade-controller-df77f4dfd-75dtl is deployed successfully on the control plane linux node
  • Notice in the System project page, apply-system-agent-upgrader-windows pod is in an error state which is deployed on the windows node

Note:

  • This issue is seen when monitoring is deployed on the cluster.
  • Also seen when you upgrade RKE2 version from 1.23 to 1.24

(SURE-5462)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 16 (9 by maintainers)

Most upvoted comments

Hey guys, do you have an update on this one?

I am able to reproduce this on rancher 2.6.8 new Custom RKE2 Cluster v1.24.4-rke2r1-85bc7d7bec85 rancher/system-upgrade-controller:v0.9.1 rancher/wins:v0.4.7

Hello, for me the same issue, I have Rocky Linux 8 servers and apply-system-agent-upgrader-* pods are failed with message:

exec /opt/rancher-system-agent-suc/run.sh: argument list too long

I didnt find any solution how to fix this… 😦 Rancher version 2.7.2, latest K8s version installed and Canal network. No additional configuration of Linux servers as it doen’t need anything…

@marekvesely-direct This error message is unrelated to the issue reported in this GitHub issue, and can instead be triggered by a misconfigured tls-ca Secret in the cattle-system Namespace of the Rancher local cluster.

Where the Rancher TLS certificate is signed by a private CA, or requires intermediate certificate(s) to establish chain of trust to a publicly trusted root CA, the tls-ca secret should contain the root CA, followed by any intermediate certificates:(https://ranchermanager.docs.rancher.com/getting-started/installation-and-upgrade/resources/add-tls-secrets#using-a-private-ca-signed-certificate).

These cacerts are retrieved from Rancher by the cluster-agent in a downstream cluster, which populates the ca.crt field of the stv-aggregation Secret in the cattle-system Namespace with their contents.

The stv-aggregation Secret is then populated as environment variables in the upgrade container of the apply-system-agent-upgrader-* Job Pods via an envFrom reference:

    envFrom:
    - secretRef:
        name: stv-aggregation

As a result, if the tls-ca Secret in the Rancher local cluster is misconfigured, for example with a complete list of an organizations CA certificates, the resulting environment variable in the apply-system-agent-upgrader-* Job can be too large, leading to an exec /opt/rancher-system-agent-suc/run.sh: argument list too long attempting to run the Pod.

This state can be resolved by recreating the tls-ca certificate with only the Rancher root CA (and any intermediate certificates) and performing a rollout restart of the Rancher deployment (https://ranchermanager.docs.rancher.com/getting-started/installation-and-upgrade/resources/update-rancher-certificate#2-createupdate-the-ca-certificate-secret-object).

Intended Test Setup

  • Rancher version(s): v2.6-head
  • Browser type & version: Brave & Most Current
  • Rancher Host: EC2 (Jenkins Job)
  • Rancher Setup: o RANCHER_HA_CERT_OPTION: byo-valid o RKE_VERSION: 1.3.13-rc4 o AWS_VOLUME_SIZE: 20 o AWS_AMI: ami-066f14e43aebc7472 o AWS_SECURITY_GROUPS: sg-0e753fd5550206e55 (open-all) o AWS_INSTANCE_TYPE: t3a.xlarge o AWS_VPC: vpc-bfccf4d7 o AWS_SUBNET: subnet-ee8cac86 o AWS_REGION: us-east-2
  • Deployment Type: Helm
  • User Permission: Admin and Standard User

Test Approach

  • Verify that when I deploy a Windows node on an RKE2 custom cluster, that the rancher/system-upgrade-controller:v0.8.1 is able to deploy successfully
  • Verify that on the System project page, the apply-system-agent-upgrader-windows pod is not in an error state
  • Verify that the Windows node goes “Active” in Rancher Verify that multiple Windows nodes can be deployed on a cluster without error
  • Verify that upgrading from RKE2 1.23 to 1.24 does not reproduce this error
  • Verify that deploying monitoring on the cluster does not reproduce this issue

cc @markusewalker

Awaiting for PR https://github.com/rancher/rancher/pull/38663 to be merged. Once merged, QA will be able to come back and validate this issue with the changes @rosskirkpat has made.