risingwave: bug: nexmark recovery test in scaling test (deterministic simulation) fails

Describe the bug

One of many examples from main: https://buildkite.com/risingwavelabs/main/builds/3797#01877b12-330d-408c-8d5e-06616cdee163 One example from main-cron: https://buildkite.com/risingwavelabs/main-cron/builds/426

To Reproduce

No response

Expected behavior

No response

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

New error found in this run: https://buildkite.com/risingwavelabs/main/builds/3902

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: internal error: service not found: /meta.NotificationService/Subscribe
--
  | 0: <risingwave_common::error::RwError as core::convert::From<risingwave_common::error::ErrorCode>>::from
  | at /risingwave/src/common/src/error.rs:174:33
  | 1: <T as core::convert::Into<U>>::into
  | at /rustc/28a29282f6dde2e4aba6e1e4cfea5c9430a00217/library/core/src/convert/mod.rs:727:9

The root case could be: meta was killed and restarted after a downtime of 20s+, a new meta got elected but the RPC service In new leader haven’t had time to initialize when receiving subscribe request. Even though there’s a retry logic in subscribe, but the config max_heartbeat_interval_secs is 10s in scaling test, which is the same as the lease expire duration and used as retry upper bound. But the possible killing downtime is set to 20s+, so it is possible to reach the retry upper bound for subscribe and still not have the leader meta node running. 🥵 Cc @shanicky

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: 
GrpcStatus(Status { code: Internal, message: "Worker node does not exist!", metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Thu, 10 Mar 2022 04:31:16 +0000"} }, source: None })', 
src/storage/compactor/src/server.rs:82:49

For this error, after another thought I suspect there may be a problem and the panic shouldn’t be allowed. The register_new and activate functions should be called continuously, without a long gap between them in the simulation. cc @yezizp2012 Do you have any idea on why this happened?

Now the max_heartbeat_interval_secs has been adjusted to 15s in scaling recovery test, and the restart delay is adjusted to 20s. Currently, the CI on the main branch looks good.

BTW, may I ask how we handle panics in madsim? Does the test case fail immediately after a panic occurs?

Madsim supports auto restarting nodes on panic. We used to enable it when error handling on CN was not completed. But soon it was disabled. Currently the test fails on any panic. 🥵

Example 2, seems like the error should be ignored instead of panicked.

I see several unwrap() during compute node/frontend/compactor startup. They are kind of reasonable because it’s cheap to restart uninitialized node.

Example 1 is a bug of madsim which has been fixed last Friday. Example 2, seems like the error should be ignored instead of panicked.