solana: Bankhash mismatch running #34623 against mainnet
Problem
My node running a recent commit (https://github.com/solana-labs/solana/commit/6a9f72910141df9a27cc3985d2f615bd33f94938) from master against mainnet crashed with bankhash mismatch on slot 243108000 (https://explorer.solana.com/block/243108000).
0: rust_begin_unwind
at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
1: core::panicking::panic_fmt
at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
2: hashbrown::map::HashMap<K,V,S,A>::retain
3: solana_core::replay_stage::ReplayStage::dump_then_repair_correct_slots
4: solana_core::replay_stage::ReplayStage::new::{{closure}}
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[2024-01-21T03:16:11.202872991Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solReplayStage" one=1i message="panicked at core/src/replay_stage.rs:1414:25:
We have tried to repair duplicate slot: 243108000 more than 10 times and are unable to freeze a block with bankhash HMk3tMMympeHyBRpoDrqMLxfLjSUZvNiLYx6i2JCuNGZ, instead we have a block with bankhash Some(7h4iAoXfX6KwitVhazib3fCnr3J4koFu7ArJZTY5heFZ). This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting" location="core/src/replay_stage.rs:1414:25" version="1.18.0 (src:00000000; feat:4046558620, client:SolanaLabs)"
[2024-01-21T03:16:11.580033138Z INFO solana_metrics::metrics] datapoint: cluster_slots_service-timing lowest_slot_elapsed=2350i process_cluster_slots_updates_elapsed=237083i
Not sure if any of our canary testing nodes catch this error too?
Proposed Solution
Not sure yet, but I will investigate.
About this issue
- Original URL
- State: closed
- Created 5 months ago
- Comments: 19 (19 by maintainers)
So to confirm, your believe that an account from a PR that has not landed yet altered your account state and caused your node to diverge ? And the fixes were pushed to your PR?
Hmm yeah, it looks like the only thing that differs here are the epoch_accounts_hash and capitalization. The fact that you node did not diverge previously would suggest that the account that caused the EAH to diverge did NOT appear as part of any bank hashes recently. So, I don’t think replaying the slot will give us any useful information. Rather, I think we would have to examine each account to find the offending one.
Here is the snippet where the EAH gets mixed into the bank hash as I was looking up for my own understanding as well: https://github.com/solana-labs/solana/blob/9db4e84e723f9b9e4c5c5ac627718301af982783/runtime/src/bank.rs#L6963-L7004