risingwave: bug: deterministic recovery test failure in main cron

Describe the bug

Two issues found in https://buildkite.com/risingwavelabs/main-cron/builds/310#0185bd0f-6cd7-454d-b511-bc5cf7f3c074.

  1. The deterministic-recovery-test seems flaky. The test exted with status -1 (agent lost) in buildkite: agent.lost: An agent has been marked as lost. This happens when Buildkite stops receiving pings from the agent, see https://buildkite.com/docs/apis/webhooks/agent-events for more details. @zwang28 guessed that the ping lost because of recovery-test draining CPU.
  2. cannot allocate memory:
fatal error: runtime: cannot allocate memory
runtime stack:
runtime.throw({0xd0c93c, 0x2030000})
	/usr/local/go/src/runtime/panic.go:1198 +0x71
runtime.persistentalloc1(0x4000, 0xc000700800, 0x14e61a8)
	/usr/local/go/src/runtime/malloc.go:1417 +0x24f
runtime.persistentalloc.func1()
	/usr/local/go/src/runtime/malloc.go:1371 +0x2e
runtime.persistentalloc(0x14cafc8, 0xc000180000, 0x40)
	/usr/local/go/src/runtime/malloc.go:1370 +0x6f
runtime.(*fixalloc).alloc(0x14e1808)
	/usr/local/go/src/runtime/mfixalloc.go:80 +0x85
runtime.(*mheap).allocMSpanLocked(0x14cafc0)
	/usr/local/go/src/runtime/mheap.go:1078 +0xa5
runtime.(*mheap).allocSpan(0x14cafc0, 0x4, 0x1, 0x0)
	/usr/local/go/src/runtime/mheap.go:1192 +0x1b7
runtime.(*mheap).allocManual(0x0, 0x0, 0x0)
	/usr/local/go/src/runtime/mheap.go:949 +0x1f
runtime.stackalloc(0x8000)
	/usr/local/go/src/runtime/stack.go:409 +0x151
runtime.malg.func1()
	/usr/local/go/src/runtime/proc.go:4224 +0x25
runtime.persistentalloc.func1()

To Reproduce

No response

Expected behavior

No response

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 3
  • Comments: 20 (20 by maintainers)

Commits related to this issue

Most upvoted comments

In the performance test and the longevity test, we use the demand instance, and there has been no agent lost so far, so it should be credible.

What is meant by “spot instance being actively recycled”?

In order to save money, we now use the spot instance to run CI. If the resources are insufficient, the spot instance will be recycled by aws, which will lead to agent lost. We’ve added retries, but only up to 2 retries.

Maybe we can increase agent timeout (assuming agent was not killed): Screenshot 2023-02-09 at 11 00 50 AM

https://buildkite.com/docs/agent/v3/configuration

Scaling test doesn’t “agent lost” anymore (but timeouts)

#8374

EBS read looks strange?

That matches my guess:

which reminds me of when I test RW manually, sometimes it runs out of memory and the EC2 will become irresponsive.

In this case, memory is running out so the system starts to frequently swap in/out pages of code segment. That’s why I suggested to use a larger machine.

Is it possible to get the memory usage?