risingwave: Longevity test release-1.6 CN OOM

"!!! longevity Result!!!"
Today's date:2024-01-22
Result               FAIL                
Pipeline Message     Test v1.6.1-rc      
TestBed              kubebench/3264g-medium-3cn-all-affinity
RW Version           v1.6.1-rc           
Test Start time      2024-01-21 17:00:08 
Test End time        2024-01-22 05:02:24 
Namespace            usrlngvty-20240121-165055
Queries              nexmark_q0,nexmark_q1,nexmark_q2,nexmark_q3,nexmark_q4,nexmark_q5,nexmark_q7,nexmark_q8,nexmark_q9,nexmark_q10,nexmark_q12,nexmark_q14,nexmark_q15,nexmark_q16,nexmark_q17,nexmark_q18,nexmark_q19,nexmark_q20,nexmark_q21,nexmark_q22,nexmark_q101,nexmark_q102,nexmark_q103,nexmark_q104,nexmark_q105
Grafana Metric       https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&var-namespace=usrlngvty-20240121-165055&from=1705856408000&to=1705899744000
Grafana Logs         https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&var-namespace=usrlngvty-20240121-165055&from=1705856408000&to=1705899744000
Memory Dumps to S3   https://s3.console.aws.amazon.com/s3/buckets/test-useast1-mgmt-bucket-archiver?region=us-east-1&prefix=k8s/usrlngvty-20240121-165055/&showversions=false
Buildkite Job        https://buildkite.com/risingwave-test/longevity-test/builds/954
Crash Container logs https://rw-qa-artifacts-public.s3.us-east-1.amazonaws.com/longevity/954_logs.txt
Report               https://rw-qa-artifacts-public.s3.us-east-1.amazonaws.com/longevity/954_report.txt


================================================================================
Restarted/Crashed Containers Details 
================================================================================
CONTAINER crashed/Restarted: benchmark-risingwave-compute-c-0 restart_count:1  phase:Running status:True
CONTAINER crashed/Restarted: benchmark-risingwave-compute-c-2 restart_count:1  phase:Running status:True

About this issue

  • Original URL
  • State: closed
  • Created 5 months ago
  • Reactions: 2
  • Comments: 17 (16 by maintainers)

Most upvoted comments

It could because we disable the memtable spill because of the inconsistent issue, which can cause the backpressure insensitive 0119 on release-1.6 https://github.com/risingwavelabs/risingwave/commit/06058f046de38e91078c1d54f880eccfc934f3de 0112 on main https://github.com/risingwavelabs/risingwave/commit/89a8297ff9082737ccf01f853e2ea3458413de28

May be due to an extra copy of block during decode after #13558. The fix #14786 just gets merged. Let’s wait for today’s longevity to see whether the situation is improved.

is it normal that only the materialize executor takes a large chunk of memory? cannot find other executors using comparable amount of memory

Ohh, indeed. The Materialize executor should not validate data here. Need to dive in.

No. We made a mistake. The ~1.5GB memory was not taken by Materialize executor itself, but by the children executors. Rust’s async is based on polling, so the children executors are under Materialize in the stack tree.

In the below graph, the highlighted boxes contain keyword “stream::executor::” i.e. they are all stream executors.

image

benchmark-risingwave-compute-c-2_1705890487-2024-01-22-02-28-06.auto.heap.collapsed.zip

The process was killed by the OOM killer. You can find the logs here.


memory: usage 13631492kB, limit 13631488kB, failcnt 3407

oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-89ddc6d34df8009759c8a68262451ea93278c07d6330a80b7c8504bc57be1196.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod455a7760_dc45_4b38_b624_977c170dd1d9.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod455a7760_dc45_4b38_b624_977c170dd1d9.slice/cri-containerd-89ddc6d34df8009759c8a68262451ea93278c07d6330a80b7c8504bc57be1196.scope,task=risingwave,pid=733811,uid=0

Memory cgroup out of memory: Killed process 733811 (risingwave) total-vm:44901588kB, anon-rss:13531468kB, file-rss:396436kB, shmem-rss:0kB, UID:0 pgtables:77204kB oom_score_adj:838

CONSTRAINT_MEMCG indicates it was killed because of exceeding the memory limit.