grafana-operator: [Bug] v5 memory usage

Describe the bug We’ve started to run v5 rc2 in parallell with v4 in one of our production clusters and during the weekend it got OOM. There is currently only one dashboard and one datasource and I see nothing strange in the --previous log.

Both limit and requests are set to 256Mi and it seems to take roughly two days to OOM. At the same time, the grafana pod also increases with the same rate.

Anyone else seeing the same behaviour?

Version v5.0.0-rc2

Screenshots Screenshot 2023-05-08 at 07 57 54

From kubectl describe:

    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 05 May 2023 09:09:36 +0200
      Finished:     Sun, 07 May 2023 02:48:32 +0200

Runtime (please complete the following information):

  • K8s: OCP 4.11.33
  • OS: RHCOS

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 20 (7 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve found that there’s goroutine leak in our implementation of grafana client. - keepalives are not disabled, each API request leads to an extra http connection. And the number goes to thousands, that’s why we see a higher memory footprint for both the operator and grafana. I’ll open a PR to fix that.

profile002

@weisdd ,

We have confirm the memory usage of quay.io/weisdd/grafana-operator:v5.0.0-goroutine-leak does not increase dramatically, thanks for the help.

@czchen great, thanks for the test! 😃

@smuda @czchen Due to severity of the bug, I think we’ll have a new official release after my PR is merged. For now, would be great if you could try this image build: quay.io/weisdd/grafana-operator:v5.0.0-goroutine-leak. It’s for linux/arm64, linux/arm/v7, linux/amd64.

Well, that ain’t good, thanks for reporting this @smuda . I haven’t had time to run the operator for any longer periods of time, so I haven’t seen it my self. But it’s definitely something that we need to look in to.

Please share any findings you do around this.