coroot: High memory usage and OOM loop

Hi,

we are experiencing high memory usage and OOM kill loop while using Coroot in our GKE cluster. Coroot container tries to allocate up to 14GiB of memory before it is killed. Here is the complete log of the container before getting killed:

I0126 13:44:42.589518 1 main.go:45] version: 0.12.1, url-base-path: /, read-only: false I0126 13:44:42.589639 1 db.go:38] using postgres database I0126 13:44:43.715195 1 cache.go:130] cache loaded from disk in 1.106531315s I0126 13:44:43.715499 1 compaction.go:81] compaction worker started I0126 13:44:43.716125 1 main.go:142] listening on :8080 I0126 13:44:44.716784 1 updater.go:54] worker iteration for krxa44eq I0126 13:44:53.716464 1 compaction.go:92] compaction iteration started

Here is the graph of memory usage: image

We set an 8GiB memory limit on the Coroot container.

Before we set the memory limit, the container allocated up to 24GiB of memory.

We tried with both SQLite and PostgreSQL and there were no differences in behavior.

Our GKE cluster version is v1.24.5-gke.600.

We have 22 Nodes, 154 Deployments, 25 DaemonSets, and 12 StatefulSets which in total have 857 Pods.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 25 (13 by maintainers)

Commits related to this issue

Most upvoted comments

Hi @YoranSys, please update to version 0.14.9. We are expecting a noticeable reduction in memory consumption and improved UI responsiveness.

Coroot v0.22+ should work much better on large clusters. Thank you, @wenhuwang, for the assistance.

Related releases: https://github.com/coroot/coroot/releases/tag/v0.22.0 https://github.com/coroot/coroot/releases/tag/v0.22.1

@wenhuwang, please upgrade Coroot using the latest helm chart.

  • the new version of node-agent (1.14.2) is expected to report significantly fewer container_net_tcp_* metrics
  • the dedicated Prometheus job for the node-agent should prevent the generation of “new” metrics due to agent rollouts

We expect much lower CPU and memory consumption within an hour or two after the upgrade. As Coroot updates metrics from Prometheus using a 1-hour time window, these changes will take effect when the “old” metrics fall out of this window. Alternatively, you have the option to delete historical data from Prometheus after upgrading the chart if it is acceptable in your case.

@apetruhin Hi, i installed coroot with version 0.21.0, my cluster has 44 nodes, 6610 pods. coroot used 15C cpu and 50G memory. image

More importantly, the coroot UI data is always empty. image

The status of coroot-related pods is normal, and no error logs are seen in coroot and promtheus pods. prometheus configuration is also right, I’m not sure what the An error has been occurred while querying Prometheus problem is.

We’re continuing to work on reducing memory consumption. Please update to version 0.14.7 to get the Postgres tab fixed.

Hi @YoranSys, please try version 0.14.6

Thank you. We have more optimizations coming soon.