risingwave: OOM for sysbench select random limits (Hummock read duration seems abnormal)

Describe the bug

https://buildkite.com/risingwave-test/sysbench/builds/554#018be963-e823-4393-aad6-38255d87bcdd/1121 image is 20231119

https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus: test-useast1-eks-a&from=1700429709000&to=1700430303000&var-namespace=sysbench-daily-test-20231119

Dashboard: SCR-20231120-is9 SCR-20231120-isi

The read duration seems abnormal.

The previous tests are all good, e.g. the latest successful one is image 20231116 https://buildkite.com/risingwave-test/sysbench/builds/553#018bd9f0-9c26-4472-a129-e0cb237ecf39

Dashboard: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?from=1700169909000&orgId=1&to=1700171403000&var-datasource=P2453400D1763B4D9&var-namespace=sysbench-daily-test-20231116&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All

According to https://github.com/risingwavelabs/rw-commits-history#nightly-20231119, it does not seem to be affected by a PR between 20231116 and 20231119.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

IIUC, we have implemented the mechanism at the executor level to kill a batch query that risks going OOM. Therefore, OOM is unexpected, although the root cause of OOM is not necessarily this mechanism.

I saw https://github.com/risingwavelabs/risingwave/pull/13132 merged in 20231115, wonder if it may have some impact?

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Reactions: 1
  • Comments: 23 (21 by maintainers)

Most upvoted comments

It seems that v1.4.0 does not have this issue. https://buildkite.com/risingwave-test/sysbench/builds/569

nightly-20231121

The sysbench 32c64g all pods affinity configuration is as such: image image

CN still OOM! However, we did not see the mem metric exceeding the limit on grafana. I guess it is because the rapid increase of cn mem caused us to not collect useful metrics.

image

https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=P2453400D1763B4D9&var-namespace=jianwei-sysbench-20231121&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All&from=1700639051172&to=1700640236739

Lots of memory hold by Hummock iteration image 1700467603-2023-11-20-08-06-42.auto.heap.collapsed.zip Seems leaks somewhere. This is critical and let’s prioritize it. Might be related #9732

How am I supposed to see the flamegraph? Is this a svg?

Drop the .collapsed file into https://www.speedscope.app/ and click the top-left “Left Heavy” button.

Lots of memory hold by Hummock iteration

image

1700467603-2023-11-20-08-06-42.auto.heap.collapsed.zip

Seems leaks somewhere. This is critical and let’s prioritize it.

Might be related #9732