rancher: [BUG] High memory usage on v2.7.5

Rancher Server Setup

Rancher version: v2.7.5
Installation option (Docker install/Helm Chart): Helm Chart
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): v1.23.17-eks-a5565ad
Proxy/Cert Details:

Describe the bug

After upgrading Rancher from v2.6.13 to v2.7.5, we didn’t face any problems right after the upgrade, but within 7 days we had to switch our main node group from t3.large (8 GB of memory per node) to t3.xlarge (16 GB of memory per node) otherwise we were not able to get healthy rancher pods (all crashing in a variety of OOMKilled, Evicted, ContainerStatusUnknown)

Current situation with healthy pods:

$ kubectl top pod -n cattle-system
NAME                                   CPU(cores)   MEMORY(bytes)
eks-config-operator-57f94d69dd-gsdf8   2m           176Mi
rancher-c875bc68b-sftq2                179m         5505Mi
rancher-c875bc68b-wkrjl                659m         8383Mi
rancher-c875bc68b-xspj4                232m         6442Mi
rancher-webhook-648db6b695-j7ptw       44m          1287Mi

Rancher manage ~20 clusters, most of them running on v1.27.3-eks-a5565ad, a single one on an old v1.21.14-eks-a5565ad.

To Reproduce

Result

Expected Result

Screenshots

Additional context

About this issue

Original URL
State: closed
Created a year ago
Comments: 19 (11 by maintainers)

Most upvoted comments

Hey @gionn thanks!

I think it is fair to close this issue, as the main reported symptoms are addressed. We are still researching ways to embed a more effective auto-cleanup, and that might include something functionally similar to your script, in one of the next versions going forward.

We will reserve the right to poke you in future when we have something to test - if you are OK with that!

And of course, this issue can be re-opened (or a new one can be opened) if symptoms are back.

Have a great day!

moio on Jul 31, 2023

Give it some time to let the cleanup kick in

pgonin on Jul 27, 2023

At this point I could try testing the 0.7.1-rc.1if it’s going to help also in cleaning up this resources, can you confirm @manno?

Yes. 0.7.1-rc.1 contains a mechanism which does the equivalent of the cleanup script, directly integrated in Fleet.

I also need some confirmation that I just need to run manually an helm upgrade with --reuse-values for both fleet-crd and fleet charts

I will be posting test instructions later this morning, thanks for your openness to try out the RC!

moio on Jul 27, 2023

TokenRequest API warnings seem to depend on a feature in beta activated on EKS clusters https://github.com/kubernetes/kubernetes/pull/117591

This will be default on 1.28, and I expect Rancher to be updated by that time (versions later than 1.26 are not officially supported https://www.suse.com/suse-rancher/support-matrix/all-supported-versions/rancher-v2-7-5/)

I think that is a red herring (annoying but not harmful).

moio on Jul 24, 2023

High number of fleet secrets sounds like an instance of https://github.com/rancher/fleet/issues/1651, can you please see if the cleanup script posted there helps?

https://github.com/rancher/fleet/issues/1651#issuecomment-1640322635

moio on Jul 24, 2023