solana: Bankhash mismatch running #34623 against mainnet

Problem

My node running a recent commit (https://github.com/solana-labs/solana/commit/6a9f72910141df9a27cc3985d2f615bd33f94938) from master against mainnet crashed with bankhash mismatch on slot 243108000 (https://explorer.solana.com/block/243108000).

   0: rust_begin_unwind
             at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
   2: hashbrown::map::HashMap<K,V,S,A>::retain
   3: solana_core::replay_stage::ReplayStage::dump_then_repair_correct_slots
   4: solana_core::replay_stage::ReplayStage::new::{{closure}}
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[2024-01-21T03:16:11.202872991Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solReplayStage" one=1i message="panicked at core/src/replay_stage.rs:1414:25:
    We have tried to repair duplicate slot: 243108000 more than 10 times and are unable to freeze a block with bankhash HMk3tMMympeHyBRpoDrqMLxfLjSUZvNiLYx6i2JCuNGZ, instead we have a block with bankhash Some(7h4iAoXfX6KwitVhazib3fCnr3J4koFu7ArJZTY5heFZ). This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting" location="core/src/replay_stage.rs:1414:25" version="1.18.0 (src:00000000; feat:4046558620, client:SolanaLabs)"
[2024-01-21T03:16:11.580033138Z INFO  solana_metrics::metrics] datapoint: cluster_slots_service-timing lowest_slot_elapsed=2350i process_cluster_slots_updates_elapsed=237083i

Not sure if any of our canary testing nodes catch this error too?

Proposed Solution

Not sure yet, but I will investigate.

About this issue

  • Original URL
  • State: closed
  • Created 5 months ago
  • Comments: 19 (19 by maintainers)

Most upvoted comments

And just making sure, you were running 6a9f729 with no other modifications ?

I think this is due to the reward PDA account was created in the previous epoch when run my node with partitioned rewards enabled (https://github.com/solana-labs/solana/pull/34809). That PR is incomplete.

I have pushed fixes for this just now.

So to confirm, your believe that an account from a PR that has not landed yet altered your account state and caused your node to diverge ? And the fixes were pushed to your PR?

bad

[2024-01-21T03:16:02.013644297Z INFO  solana_runtime::bank] 
bank frozen: 243108000 hash: 7h4iAoXfX6KwitVhazib3fCnr3J4koFu7ArJZTY5heFZ 
accounts_delta: GCj8CFVaqeHwkVgaLb6f9PaPYeZMofBsKL1op3LXUJJ8 signature_count: 500 last_blockhash: Ant9w6LfnGbG4Jpm5VMxmyRpzN3myx3dDpAQSE3kTwcs 
capitalization: 567558615827727708, epoch_accounts_hash: xwb6iXsG3vdHgUTAFkgUiACYPwYkW8F7465CNL9WVaC, 
stats: BankHashStats { num_updated_accounts: 1465, num_removed_accounts: 14, num_lamports_stored: 39484939731971, total_data_len: 10492554, num_executable_accounts: 1 }

good

[2024-01-21T03:16:02.022355358Z INFO  solana_runtime::bank] 
bank frozen: 243108000 hash: HMk3tMMympeHyBRpoDrqMLxfLjSUZvNiLYx6i2JCuNGZ 
accounts_delta: GCj8CFVaqeHwkVgaLb6f9PaPYeZMofBsKL1op3LXUJJ8 signature_count: 500 last_blockhash: Ant9w6LfnGbG4Jpm5VMxmyRpzN3myx3dDpAQSE3kTwcs 
capitalization: 567558615826502748, epoch_accounts_hash: 824tUYuwAKFv2kKz5m2Xf8YHNYhYUhsqwohjmrvTp3Be, 
stats: BankHashStats { num_updated_accounts: 1465, num_removed_accounts: 14, num_lamports_stored: 39484939731971, total_data_len: 10492554, num_executable_accounts: 1 }

Hmm yeah, it looks like the only thing that differs here are the epoch_accounts_hash and capitalization. The fact that you node did not diverge previously would suggest that the account that caused the EAH to diverge did NOT appear as part of any bank hashes recently. So, I don’t think replaying the slot will give us any useful information. Rather, I think we would have to examine each account to find the offending one.

Here is the snippet where the EAH gets mixed into the bank hash as I was looking up for my own understanding as well: https://github.com/solana-labs/solana/blob/9db4e84e723f9b9e4c5c5ac627718301af982783/runtime/src/bank.rs#L6963-L7004