risingwave: bug: deterministic recovery test failure in main cron
Describe the bug
Two issues found in https://buildkite.com/risingwavelabs/main-cron/builds/310#0185bd0f-6cd7-454d-b511-bc5cf7f3c074.
- The deterministic-recovery-test seems flaky. The test exted with status -1 (agent lost) in buildkite:
agent.lost:An agent has been marked as lost. This happens when Buildkite stops receiving pings from the agent, see https://buildkite.com/docs/apis/webhooks/agent-events for more details. @zwang28 guessed that the ping lost because of recovery-test draining CPU. - cannot allocate memory:
fatal error: runtime: cannot allocate memory
runtime stack:
runtime.throw({0xd0c93c, 0x2030000})
/usr/local/go/src/runtime/panic.go:1198 +0x71
runtime.persistentalloc1(0x4000, 0xc000700800, 0x14e61a8)
/usr/local/go/src/runtime/malloc.go:1417 +0x24f
runtime.persistentalloc.func1()
/usr/local/go/src/runtime/malloc.go:1371 +0x2e
runtime.persistentalloc(0x14cafc8, 0xc000180000, 0x40)
/usr/local/go/src/runtime/malloc.go:1370 +0x6f
runtime.(*fixalloc).alloc(0x14e1808)
/usr/local/go/src/runtime/mfixalloc.go:80 +0x85
runtime.(*mheap).allocMSpanLocked(0x14cafc0)
/usr/local/go/src/runtime/mheap.go:1078 +0xa5
runtime.(*mheap).allocSpan(0x14cafc0, 0x4, 0x1, 0x0)
/usr/local/go/src/runtime/mheap.go:1192 +0x1b7
runtime.(*mheap).allocManual(0x0, 0x0, 0x0)
/usr/local/go/src/runtime/mheap.go:949 +0x1f
runtime.stackalloc(0x8000)
/usr/local/go/src/runtime/stack.go:409 +0x151
runtime.malg.func1()
/usr/local/go/src/runtime/proc.go:4224 +0x25
runtime.persistentalloc.func1()
To Reproduce
No response
Expected behavior
No response
Additional context
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 3
- Comments: 20 (20 by maintainers)
Commits related to this issue
- ci: limit parallel to run 32 jobs each time Try to mitigate #7561 — committed to risingwavelabs/risingwave by xxchan a year ago
In the performance test and the longevity test, we use the demand instance, and there has been no agent lost so far, so it should be credible.
In order to save money, we now use the spot instance to run CI. If the resources are insufficient, the spot instance will be recycled by aws, which will lead to agent lost. We’ve added retries, but only up to 2 retries.
Maybe we can increase agent timeout (assuming agent was not killed):
https://buildkite.com/docs/agent/v3/configuration
#8374
That matches my guess:
In this case, memory is running out so the system starts to frequently swap in/out pages of code segment. That’s why I suggested to use a larger machine.
Is it possible to get the memory usage?
More tests failed today. https://buildkite.com/risingwavelabs/main-cron/builds/344#0186485b-9e8d-417b-bdba-f779cdaba886