risingwave: bug: nexmark recovery test in scaling test (deterministic simulation) fails
Describe the bug
One of many examples from main: https://buildkite.com/risingwavelabs/main/builds/3797#01877b12-330d-408c-8d5e-06616cdee163
One example from main-cron: https://buildkite.com/risingwavelabs/main-cron/builds/426
To Reproduce
No response
Expected behavior
No response
Additional context
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (15 by maintainers)
New error found in this run: https://buildkite.com/risingwavelabs/main/builds/3902
The root case could be: meta was killed and restarted after a downtime of 20s+, a new meta got elected but the RPC service In new leader haven’t had time to initialize when receiving subscribe request. Even though there’s a retry logic in subscribe, but the config
max_heartbeat_interval_secsis10sin scaling test, which is the same as the lease expire duration and used as retry upper bound. But the possible killing downtime is set to 20s+, so it is possible to reach the retry upper bound for subscribe and still not have the leader meta node running. 🥵 Cc @shanickyFor this error, after another thought I suspect there may be a problem and the panic shouldn’t be allowed. The
register_newandactivatefunctions should be called continuously, without a long gap between them in the simulation. cc @yezizp2012 Do you have any idea on why this happened?Now the
max_heartbeat_interval_secshas been adjusted to 15s in scaling recovery test, and the restart delay is adjusted to 20s. Currently, the CI on the main branch looks good.Madsim supports auto restarting nodes on panic. We used to enable it when error handling on CN was not completed. But soon it was disabled. Currently the test fails on any panic. 🥵
I see several unwrap() during compute node/frontend/compactor startup. They are kind of reasonable because it’s cheap to restart uninitialized node.
Example 1 is a bug of madsim which has been fixed last Friday.
Example 2, seems like the error should be ignored instead of panicked.