coroot: High memory usage and OOM loop

Hi,

we are experiencing high memory usage and OOM kill loop while using Coroot in our GKE cluster. Coroot container tries to allocate up to 14GiB of memory before it is killed. Here is the complete log of the container before getting killed:

I0126 13:44:42.589518 1 main.go:45] version: 0.12.1, url-base-path: /, read-only: false I0126 13:44:42.589639 1 db.go:38] using postgres database I0126 13:44:43.715195 1 cache.go:130] cache loaded from disk in 1.106531315s I0126 13:44:43.715499 1 compaction.go:81] compaction worker started I0126 13:44:43.716125 1 main.go:142] listening on :8080 I0126 13:44:44.716784 1 updater.go:54] worker iteration for krxa44eq I0126 13:44:53.716464 1 compaction.go:92] compaction iteration started

Here is the graph of memory usage:

We set an 8GiB memory limit on the Coroot container.

Before we set the memory limit, the container allocated up to 24GiB of memory.

We tried with both SQLite and PostgreSQL and there were no differences in behavior.

Our GKE cluster version is v1.24.5-gke.600.

We have 22 Nodes, 154 Deployments, 25 DaemonSets, and 12 StatefulSets which in total have 857 Pods.

About this issue

Original URL
State: open
Created a year ago
Comments: 25 (13 by maintainers)

Commits related to this issue

constructor: fix RDS instances lookup (#18) — committed to coroot/coroot by apetruhin a year ago

Most upvoted comments

Hi @YoranSys, please update to version 0.14.9. We are expecting a noticeable reduction in memory consumption and improved UI responsiveness.

apetruhin on Mar 21, 2023

Coroot v0.22+ should work much better on large clusters. Thank you, @wenhuwang, for the assistance.

apetruhin on Nov 29, 2023

@wenhuwang, please upgrade Coroot using the latest helm chart.

the new version of node-agent (1.14.2) is expected to report significantly fewer container_net_tcp_* metrics
the dedicated Prometheus job for the node-agent should prevent the generation of “new” metrics due to agent rollouts

We expect much lower CPU and memory consumption within an hour or two after the upgrade. As Coroot updates metrics from Prometheus using a 1-hour time window, these changes will take effect when the “old” metrics fall out of this window. Alternatively, you have the option to delete historical data from Prometheus after upgrading the chart if it is acceptable in your case.

apetruhin on Nov 2, 2023

@apetruhin Hi, i installed coroot with version 0.21.0, my cluster has 44 nodes, 6610 pods. coroot used 15C cpu and 50G memory.

More importantly, the coroot UI data is always empty.

The status of coroot-related pods is normal, and no error logs are seen in coroot and promtheus pods. prometheus configuration is also right, I’m not sure what the An error has been occurred while querying Prometheus problem is.

wenhuwang on Nov 2, 2023

We’re continuing to work on reducing memory consumption. Please update to version 0.14.7 to get the Postgres tab fixed.

apetruhin on Mar 16, 2023

Hi @YoranSys, please try version 0.14.6

apetruhin on Mar 16, 2023

Thank you. We have more optimizations coming soon.

apetruhin on Mar 10, 2023