kube-state-metrics: Repeated OOM'ing (perhaps due to a large number of namespaces)
/kind bug
What happened:
I’m running kube-state-metrics as part of kube-prometheus but it’s repeatedly triggering an OOMKilled.
I suspect this is because of the large number of namespaces we have. Some bits of information:
$ kubectl get ns | wc -l
238
$ kubectl get nodes | wc -l
47
$ kubectl get pods --all-namespaces | wc -l
4008
$ kubectl get secrets --all-namespaces | wc -l
8313
The resource request and limits are: { "cpu": "188m", "memory": "5290Mi" }. (Unfortunately, I’m having trouble getting resource utilization before the oom)
What you expected to happen:
Not OOM
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.7", GitCommit:"dd5e1a2978fd0b97d9b78e1564398aeea7e7fe92", GitTreeState:"clean", BuildDate:"2018-04-19T00:05:56Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.4-gke.2", GitCommit:"eb2e43842aaa21d6f0bb65d6adf5a84bbdc62eaf", GitTreeState:"clean", BuildDate:"2018-06-15T21:48:39Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
- Kube-state-metrics image version
"quay.io/coreos/kube-state-metrics:v1.3.1"
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 17 (7 by maintainers)
For anyone else that lands here investigating a similar issue, it seems like a large aggregate number of any/all resource tracked by this exporter can cause it to use a fair bit of memory. The simplest way is to query for counts of metrics like
kube_*, but if that’s fallen out of your history you can also bump up the memory limit on the exporter and then query it directly.In my case I learned that Helm doesn’t necessarily clean up old release revisions and I had 3600+ configmaps cluttering up the cluster.
Once you get your house in order you can restart the exporter to check that its memory usage is within reason, and then bump the limit back down.
Could you try removing the addon-resizer and just remove all resource limits and requests? I have a feeling that the resource recommendations that we have currently are off. They are from the scalability tests from around a year ago.