kubernetes: kube-controller-manager v1.26 high cpu usage
What happened?
Upgraded cluster from v1.24.6 to v1.26.4 (via v1.25.9) and after that kube-controller-manager starts to eat all available cpu:
Also I see massive workqueue_retries_total{name="cronjob"} metrics rate increase - from 2-3 per second to 20-30k:
Dumped pprof profile from kube-controller-manager and also see massive cronjob related load:
What did you expect to happen?
Same CPU usage of kube-controller-manager.
How can we reproduce it (as minimally and precisely as possible)?
IDK. We have 7 similar clusters of same version - issue is presents only in one of them.
Anything else we need to know?
~170 cronjobs ~300 job
Deleted some cronjobs and old failed jobs - didn’t see any effect.
Kubernetes version
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.4", GitCommit:"f89670c3aa4059d6999cb42e23ccb4f0b9a03979", GitTreeState:"clean", BuildDate:"2023-04-12T12:05:35Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
none, self-hosted baremetal
OS version
# On Linux:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
$ uname -a
Linux kube-[REDACTED] 5.15.0-53-generic #59~20.04.1-Ubuntu SMP Thu Oct 20 15:10:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Install tools
ansible
Container runtime (CRI) and version (if applicable)
containerd 1.6.8-1
Related plugins (CNI, CSI, …) and versions (if applicable)
flannel v0.14.0
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 25 (17 by maintainers)
Commits related to this issue
- trace: #118706 - bug-case reproduced — committed to sxllwx/kubernetes by sxllwx 9 months ago
- bugfix: make test for #118706 — committed to sxllwx/kubernetes by sxllwx 8 months ago
1.28 pick: https://github.com/kubernetes/kubernetes/pull/121536 - picks only #121327 1.27 pick: https://github.com/kubernetes/kubernetes/pull/121537 - picks only #121327 1.26 pick: https://github.com/kubernetes/kubernetes/pull/121540 - picks #110838 and #121327
@sxllwx thx for your effort with reproducer
I’ll make sure to prioritize this next week on Monday to figure out possible fixes.
don’t set TZ in the string schedule (that will be forbidden in future releases), set it in the cronjob spec