solana: rocksdb occasionally segfaults in tests

Problem

For tests which spin up a LocalCluster instance, Rocksdb is occasionally segfaulting.

Procedure to reproduce:

for i in $(seq 1000); do
  RUSTFLAGS='-g -C force-frame-pointers=yes -C target-cpu=native' \
  cargo +1.60.0 test --jobs 29 --package solana-cli \
    --test program --test stake --test transfer -- --nocapture
  echo "REP: $(date +%H:%M), ${i}"
done &>> ~/cargo-test-output.txt &

Then monitor with

tail -f ~/cargo-test-output.txt | grep '\(REP:\|SIGSEGV\)'

It will catch output like:

process didn't exit successfully: `~/solana/target/debug/deps/stake-c0e03b44927935e9 --nocapture` (sign
al: 11, SIGSEGV: invalid memory reference)

The frequency of segfaults might be machine dependent. I have not been able to reproduce locally but it does segfaults on gce machines. On master commit: 655b40a2b7ae43f5575c2141ceb0fdb322f22bf1 it will segfaults like 2-3 times out of 1000 iterations.

I am attaching some backtraces as well: bt-master-00.txt bt-master-01.txt bt-master-02.txt bt-master-full-00.txt bt-master-full-01.txt bt-master-full-02.txt

All point to the same spot:

#0  0x0000000000000000 in ?? ()
#1  0x000055a5023be0ec in rocksdb::UserComparatorWrapper::Compare (this=0x7efd1b1da4e0, a=..., b=...)
    at rocksdb/util/user_comparator_wrapper.h:33

Bisect Progress (Update in Place)

Below, GOOD means the issue did NOT reproduce whereas BAD means it did reproduce with the given commit. Note that this method isn’t completely full-proof as there is obviously some nondeterministic behavior and as such, there is potential for a false GOOD rating if the issue doesn’t occur in the 1,000 iterations we decided on.

commit   - result (verifiers where B = Behzad, J = Jeff, S = Steve, Y = Yueh Hsuan) - commit date
v1.9.18  - GOOD (B)
v1.10.24 - GOOD (B)
v1.10.25 - GOOD (S)
a829ddc922751b909a5922845b6cdd46e984487b - GOOD (J) - Sun May 1 13:11:45 2022 +0000
032a2b82157cc6792507ad096093bd50ab5f0065 - GOOD (Y) - Sun May 8 10:11:10 2022 -0700
0514e83fa2e7444bbaeb11cdc3567dc02bb320b8 - GOOD (Y) - Wed May 11 19:08:11 2022 -0500
fc793de2ddb9b97100e6f00c2f69a1a78f871de5 - GOOD (Y) - Thu May 12 14:48:29 2022 -0500
295df294de04e2daf0744820af0ffa4f39af7130 - GOOD (Y) - Thu May 12 20:28:50 2022 -0600
6c10515ccf86f4404ac0889da06c1b0c7746c17c - GOOD (Y) - Sun May 15 20:04:17 2022 +0800
0820065c98fc6166008bf32e84d4adfaf12584c8 - BAD (Y) - Tue May 17 21:02:43 2022 +0200 
dcce90555bb26493b3379801001bf59b46a3a3fe - GOOD? (Y) - Wed May 18 11:44:35 2022 -0700
4e9b1891544227aa6330efb3586b54fc5a18102e - GOOD? (J) - Fri May 20 16:05:42 2022 -0700
e02537671963cd5c8f2ad7d4b70df96776f8dfd5 - GOOD? (S) - Fri May 20 17:00:13 2022 -0700
d65aab4d275b498b43bf5f087f0b58202b383911 - BAD  (S) - Fri May 20 20:56:06 2022 -0600
7fb0ef1fa5b6815d67d74f4ef19ad4c91bb81584 - BAD  (S) - Fri May 20 21:20:47 2022 -0700
773b2f23f4b327b96e7c0a28a2d47ee516f3f048 - BAD  (Y) - Sun May 22 01:09:26 2022 +0800
0376ab41a9b714b4792740556bb77590c6576876 - BAD  (J,Y) - Tue May 24 10:39:33 2022 -0700

Proposed Solution

So far I have not seen a segfault on v1.9.28 or v1.10.24. So some git bisecting may help to find the culprit. The process is very slow though, have to run tests like 1000 times to get few segfaults.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 38 (38 by maintainers)

Most upvoted comments

I must have been doing something dumb; was able to reproduce, although I’m getting SIGILL instead of SIGSEGV.

 process didn't exit successfully: `/home/steven_solana_com/solana/target/debug/deps/stake-c0e03b44927935e9
--nocapture` (signal: 4, SIGILL: illegal instruction)

Looking at my dump, I got the top line of the backtrace that was unknown in the backtraces behzad supplied; rest of the backtrace looks to be the same:

#0  rocksdb::Customizable::~Customizable (this=0x5577f1cc5f98 <rocksdb::BytewiseComparator()::bytewise>) at rocksdb/include/rocksdb/customizable.h:58
#1  0x00005577ed6ad5b5 in rocksdb::UserComparatorWrapper::Compare (this=0x7f087c155f70, a=..., b=...) at rocksdb/util/user_comparator_wrapper.h:33
...

I don’t have any other ideas besides bisect at the moment and since it takes just a few minutes to start, I’ll bisect in parallel with @yhchiang-sol.

Also, I combined all of the test data-points into the problem description at the top. Let’s update the list there for the sake of having all results in one spot.