openraft: Example raft-kv-memstore hangs after printing change-membership
Took a latest checkout of main (commit hash: 347aca11c913b814bba77cfee6f9635c03b353e3) and ran
raft-kv-memstore$ ./test-cluster.sh
But it hangs after reaching this place…
… Changing membership from [1] to 3 nodes cluster: [1, 2, 3]
— rpc(:21001/change-membership, [1, 2, 3])
I see that 3 process are running and first process is using 100% CPU and other 2 are almost idle. Could not access discord channel (for some reason it does not open) hence reporting it here. If required I can share any specific logs. p.s: Example - raft-kv-rocksdb works fine.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 25 (11 by maintainers)
Commits related to this issue
- Fix: workaround cargo leaking SSL_CERT_FILE issue On Linux: command `cargo run` pollutes environment variables: It leaks `SSL_CERT_FILE` and `SSL_CERT_DIR` to the testing sub progress it runs. Which ... — committed to drmingdrmer/openraft by drmingdrmer 2 years ago
- Fix: workaround cargo leaking SSL_CERT_FILE issue On Linux: command `cargo run` pollutes environment variables: It leaks `SSL_CERT_FILE` and `SSL_CERT_DIR` to the testing sub progress it runs. Which ... — committed to drmingdrmer/openraft by drmingdrmer 2 years ago
- Fix: workaround cargo leaking SSL_CERT_FILE issue On Linux: command `cargo run` pollutes environment variables: It leaks `SSL_CERT_FILE` and `SSL_CERT_DIR` to the testing sub progress it runs. Which ... — committed to drmingdrmer/openraft by drmingdrmer 2 years ago
- Fix: workaround cargo leaking SSL_CERT_FILE issue On Linux: command `cargo run` pollutes environment variables: It leaks `SSL_CERT_FILE` and `SSL_CERT_DIR` to the testing sub progress it runs. Which ... — committed to drmingdrmer/openraft by drmingdrmer 2 years ago
- Fix: workaround cargo leaking SSL_CERT_FILE issue On Linux: command `cargo run` pollutes environment variables: It leaks `SSL_CERT_FILE` and `SSL_CERT_DIR` to the testing sub progress it runs. Which ... — committed to drmingdrmer/openraft by drmingdrmer 2 years ago
Nice catch, in Artix don’t happens probably because it uses a more modern version of OpenSSL. Will try to check as soon as I can get into the computer
This problem is caused by an issue of cargo that affects crate openssl.
In the example
example/raft-key-value-memstore, it uses reqwest to send an RPC to raft service. Andreqwestdepends onnative-tls, whilenative-tlsdepends on openssl.When calling
openssl::SslConnector::builder(): it will load certificate files if env variableSSL_CERT_FILEis set: https://github.com/sfackler/rust-openssl/blob/eaee383429d156bd91c4a188ba57cf1747c2e440/openssl/src/ssl/mod.rs#L889-L896straceshows the time spent on loading certificates:And on Linux(at least on ubuntu), command
cargo runpollutes environment variables: It leaksSSL_CERT_FILEandSSL_CERT_DIRto the testing sub progress it runs.Finally, every time a
reqwest::Clientis created, it spends several dozen milliseconds on loading a bunch of certificates, which times out a raft RPC(50 ms).I created a mini demo showing that on linux,
cargo xxxwill slow downopensslbased program: https://github.com/drmingdrmer/test-native-tls-rsOn linux:
On my m1 mac:
Related issues:
Works under cargo test, but not when harness is executed directly? https://github.com/sfackler/rust-openssl/issues/575
Cargo appears to leak SSL_CERT_FILE and SSL_CERT_DIR to subprocesses https://github.com/rust-lang/cargo/issues/3676
Cargo binaries are not run in pristine environment https://github.com/rust-lang/cargo/issues/2888
In all of my tests it happened if the hb was configured to any value lower than 101 milliseconds, it also only happened in Pop_OS! when I tried, unfortunately I didn’t find the time to try with other Ubuntu or Debían based distributions, it definitely wasn’t a problem in Artix
Third test environment:
No issues. Probably something with Ubuntu, will try to run on this last env with a live distro or VM if I find the time
Definitely is something wrong with the environment. I ran it in another environment with zero issues, this are my details:
Legend:
I will try in another computer with also Artix Linux kernel 6.0.x and AMD 3950X CPU in a few minutes
I am experiencing the same issue, the test just hangs after showing the message of membership changing, there is 0ms of delay as I have no proxies or anything configured in this computer. The
cluster_test.rsalso hangs in the same point.As @vishwaspai if I increase the
heartebeat_timeoutto some value higher than 100 it works (the test fails as the last read from node 3 returns an empty string instead of bar in the test but t works fine in the shell script). I have no clue what is going on but the100value barrier feels quite suspicious to me, will try to test in another environments.Attached logs of the three nodes plus my Cargo.lock in case it can be of any help
raft-kv-memstore-logs-and-lock.tar.gz
Could not find much with additional debugging. With heartbeat of above
100msthings work fine. Below this, I see that theraft-appendrpc timesout. I tried some tweaking (number of workers etc) on actix_web, but I do not much change in behavior.For now, I ran with
-w '\nConnect: %{time_connect}s\nXfer : %{time_starttransfer}s\nTotal : %{time_total}s\n'to curl (instead oftime) and following is the result.Based on the results I see that
add-learneris taking time. Other curl requests fast. At least, I don’t see this as networking issue. Anyways, I’ll see if I can spend time on single stepping.Added all the information in the attached tar.
raft-kv-memstore-logs-ac48309.tar.gz
I did not yet find what was going on from the log.
I updated the main branch to let examples output a more detailed log. May you rerun the test with the latest main branch ac4830923bb828288f8a33538991a2964658821a? And let’s see what’s going on.
And what’s your OS and the rust-toolchain?
And may you attach the
examples/raft-kv-memstore/Cargo.lockthat is built when runningtest-cluster.sh? So that I can check the dependency crate versions.No. I’m running on i7-12700H with 14 cores.
Attached logs
raft-kv-memstore-logs.tar.gz
The logs will be helpful. After running
./test-cluster.sh, there should be 3 logs in direxamples/raft-kv-memstore:n1.log n2.log n3.log.Attach these logs, please and let me look at what’s happening.
Thanks for letting me know about this issue:D