solana: validator version 1.16.16 crashed with "unable to freeze a block with bankhash"

Problem

I am running a primary/secondary validator setup with two physically separate validators. Both validators were running 1.16.16-jito w/local vote mods (the vote mods are not believed to alter any aspect of accounts state, because the mods only alter which slots are voted on in replay_stage.rs).

Both primary and secondary crashed simultaneously with the same error message. This is two separate validators experiencing the same issue. Note that the secondary likely took the tower from the primary shortly before the primary crashed as is common behavior when a secondary’s tower gets out of sync with the primary.

The log message on both validators was:

[2023-10-16T22:47:33.894214958Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solReplayStage" one=1i message="panicked at 'We have tried to repair duplicate slot: 224116920 more than 10 times and are unable to freeze a block with bankhash 38rFNeVT1FGHResoXsgTN2cyyjBiJd6ZFcXWCPnXzZn7, instead we have a block with bankhash Some(GVZoNMHn1o8CB76i3LLRHjTA1p8f4iyTa6hn9eopWJh1). This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting', core/src/replay_stage.rs:1585:25" location="core/src/replay_stage.rs:1585:25" version="1.16.16 (src:00000000; feat:4033350765, client:JitoLabs)"

I have collected the following files from one of the validators:

snapshot: https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/snapshot-224107039-GTo6rcFciwNew8Bx9Hggv7n9oS5AZVeQGSXQMkaFfkfy.tar.zst

incremental snapshot: https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/incremental-snapshot-224107039-224116213-qg6B13UGAAMKDm6XqG6p6PchnZ7RAgjhbe759igiFce.tar.zst

ledger (copied 10,000 slots leading up to the crash): https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/ledger.tar.gz

validator logs leading up to the crash: https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/validator.log.gz

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 31 (31 by maintainers)

Most upvoted comments

That being said, I did just reproduce the crash with a stock 1.16.19 not running JITO. So I am now switching my primary and secondary to 1.17.5 which seems to be impervious to this issue.

Recall that v1.17.5 isn’t officially suggested for mnb so you may encounter other incompatibilities (hopefully not). If you’re up for it, I did do the cherry-pick of that debug file to v1.16; that file would be pretty helpful for getting some insight into what the issue might be that you’re facing.

I appreciate that, but thus far the only software that hasn’t crashed on me in this way is 1.17.5. So I’m sticking with that on my primary.

I’ll run your patch on 1.16.19 on my secondary.