solana: Consistent segfaults on 1.10.34
I replicated the issue 3 different times:
1st time:
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809076] blockstore_22[18150]: segfault at 561791650aa0 ip 0000561791650aa0 sp 00007ecb3f2373f8 error 15
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809080] blockstore_11[18089]: segfault at 561791650aa0 ip 0000561791650aa0 sp 00007ecb46c74768 error 15
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809090] blockstore_5[18055]: segfault at 561791650aa0 ip 0000561791650aa0 sp 00007ecb4ae95318 error 15
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809096] Code: 00 00 40 82 bd 8f 17 56 00 00 90 53 be 8f 17 56 00 00 80 15 be 8f 17 56 00 00 a0 6b be 8f 17 56 00 00 70 40 be 8f 17 56 00 00 <08> f0 60 f2 86 7f 00 00 60 00 b3 90 17 56 00 00 00 00 00 00 00 00
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809106] in solana-validator[5617915bd000+165000]
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809107] in solana-validator[5617915bd000+165000]
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809118] Code: 00 00 40 82 bd 8f 17 56 00 00 90 53 be 8f 17 56 00 00 80 15 be 8f 17 56 00 00 a0 6b be 8f 17 56 00 00 70 40 be 8f 17 56 00 00 <08> f0 60 f2 86 7f 00 00 60 00 b3 90 17 56 00 00 00 00 00 00 00 00
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809127] Code: 00 00 40 82 bd 8f 17 56 00 00 90 53 be 8f 17 56 00 00 80 15 be 8f 17 56 00 00 a0 6b be 8f 17 56 00 00 70 40 be 8f 17 56 00 00 <08> f0 60 f2 86 7f 00 00 60 00 b3 90 17 56 00 00 00 00 00 00 00 00
Second time:
Aug 7 13:05:06 NM-PROD-RPC1 kernel: [37423.188157] show_signal_msg: 20 callbacks suppressed
Aug 7 13:05:06 NM-PROD-RPC1 kernel: [37423.188159] sol-rpc-el[3075]: segfault at 55d85fbf7aa0 ip 000055d85fbf7aa0 sp 00007f20461e0768 error 15 in solana-validator[55d85fb64000+165000]
Aug 7 13:05:06 NM-PROD-RPC1 kernel: [37423.188166] Code: 00 00 40 f2 17 5e d8 55 00 00 90 c3 18 5e d8 55 00 00 80 85 18 5e d8 55 00 00 a0 db 18 5e d8 55 00 00 70 b0 18 5e d8 55 00 00 <08> a0 87 37 d8 7f 00 00 60 70 0d 5f d8 55 00 00 00 00 00 00 00 00
Third time:
Aug 7 16:26:17 NM-PROD-RPC1 kernel: [49493.988592] rocksdb:low[49990]: segfault at 5602332c8aa0 ip 00005602332c8aa0 sp 00007f5d7d45d018 error 15 in solana-validator[560233235000+165000]
Aug 7 16:26:17 NM-PROD-RPC1 kernel: [49493.988599] Code: 00 00 40 02 85 31 02 56 00 00 90 d3 85 31 02 56 00 00 80 95 85 31 02 56 00 00 a0 eb 85 31 02 56 00 00 70 c0 85 31 02 56 00 00 <08> 50 87 09 5e 7f 00 00 60 80 7a 32 02 56 00 00 00 00 00 00 00 00
Aug 7 16:26:17 NM-PROD-RPC1 kernel: [49493.991253] solana-window-i[51047]: segfault at 5602332c8aa0 ip 00005602332c8aa0 sp 00007e9c50bf97d8 error 15 in solana-validator[560233235000+165000]
Aug 7 16:26:17 NM-PROD-RPC1 kernel: [49493.991259] Code: 00 00 40 02 85 31 02 56 00 00 90 d3 85 31 02 56 00 00 80 95 85 31 02 56 00 00 a0 eb 85 31 02 56 00 00 70 c0 85 31 02 56 00 00 <08> 50 87 09 5e 7f 00 00 60 80 7a 32 02 56 00 00 00 00 00 00 00 00
This occurs when a massive amount of RPC traffic occurs for an extended period of time. The system COMPLETELY halts. No chance to recover itself as it requires a full power reset on the associated server chassis.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 19 (19 by maintainers)
Commits related to this issue
- Fix a corner-case panic in get_entries_in_data_block() (#27195) #### Problem get_entries_in_data_block() panics when there's inconsistency between slot_meta and data_shred. However, as we don't ... — committed to solana-labs/solana by yhchiang-sol 2 years ago
- Fix a corner-case panic in get_entries_in_data_block() (#27195) #### Problem get_entries_in_data_block() panics when there's inconsistency between slot_meta and data_shred. However, as we don't lock... — committed to solana-labs/solana by yhchiang-sol 2 years ago
- Fix a corner-case panic in get_entries_in_data_block() (#27195) #### Problem get_entries_in_data_block() panics when there's inconsistency between slot_meta and data_shred. However, as we don't lock... — committed to solana-labs/solana by yhchiang-sol 2 years ago
- Fix the inconsistency check in get_entries_in_data_block() (backport #27195) (#27231) Fix a corner-case panic in get_entries_in_data_block() (#27195) #### Problem get_entries_in_data_block() panics ... — committed to solana-labs/solana by mergify[bot] 2 years ago
- Fix the inconsistency check in get_entries_in_data_block() (backport #27195) (#27232) Fix a corner-case panic in get_entries_in_data_block() (#27195) #### Problem get_entries_in_data_block() panics ... — committed to solana-labs/solana by mergify[bot] 2 years ago
- Fix a corner-case panic in get_entries_in_data_block() (#27195) #### Problem get_entries_in_data_block() panics when there's inconsistency between slot_meta and data_shred. However, as we don't ... — committed to HaoranYi/solana by yhchiang-sol 2 years ago
- Refactor epoch reward 4 (#27261) * refactor: extract store_stake_accounts fn * refactor: extract store_vote_account fn * refactor: extract reward history update fn * remove avg point value from pa... — committed to solana-labs/solana by HaoranYi 2 years ago
Just enabled apport to capture the core dump. It is only a matter of time before it happens again and can provide you with a core dump.
Re: #25941 that is super interesting. Definitely could see that as a potential issue. I will let you know as soon as I have a core dump to share! Thanks Steviez!
@codemonkey6969 - I had asked in Discord but might have been lost, do you have any core dumps available? We definitely appreciate you reporting; however, the segfault logs lines unfortunately don’t give us much to go off. Being able to poke around a core dump might be more illuminating.
One interesting note is that I had previously been a little suspicious of blockstore/rocksdb because of https://github.com/solana-labs/solana/issues/25941. The threads in this report that are segfaulting all have a handle to blockstore. More so, one of the segfaults under your
Third timesection block is a thread spun up by rocksdb.@yhchiang-sol - FYI for visibility, and maybe you can comment if there might be anything of value in rocksdb logs to inspect.