solana: v1.10: 'solana-replay-stage' panicked at 'slot=XXX must exist in ProgressMap', core/src/progress_map.rs:576:32

The following panic was reported on v1.10.25 (src:d64f6808; feat:965221688) by a mainnet validator:

un 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.036263075Z WARN  solana_core::replay_stage] purging slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.065577144Z WARN  solana_core::replay_stage] purging descendant: 138158378 with slot_id 508786, of slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.067850205Z WARN  solana_core::replay_stage] purging descendant: 138158377 with slot_id 508785, of slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.069361795Z WARN  solana_core::replay_stage] purging descendant: 138158379 with slot_id 508788, of slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.072372591Z WARN  solana_core::replay_stage] purging descendant: 138158376 with slot_id 508783, of slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.073289748Z WARN  solana_core::replay_stage] Notifying repair service to repair duplicate slot: 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: thread 'solana-replay-stage' panicked at 'slot=138158379 must exist in ProgressMap', core/src/progress_map.rs:576:32
Jun 19 14:55:13 mainnet solana-validator[2238461]: stack backtrace:
Jun 19 14:55:13 mainnet solana-validator[2238461]:    0: rust_begin_unwind
Jun 19 14:55:13 mainnet solana-validator[2238461]:              at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:498:5
Jun 19 14:55:13 mainnet solana-validator[2238461]:    1: core::panicking::panic_fmt
Jun 19 14:55:13 mainnet solana-validator[2238461]:              at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:116:14
Jun 19 14:55:13 mainnet solana-validator[2238461]:    2: solana_core::progress_map::ProgressMap::get_leader_propagation_slot_must_exist
Jun 19 14:55:13 mainnet solana-validator[2238461]:    3: solana_core::replay_stage::ReplayStage::retransmit_latest_unpropagated_leader_slot
Jun 19 14:55:13 mainnet solana-validator[2238461]:    4: solana_core::replay_stage::ReplayStage::new::{{closure}}
Jun 19 14:55:13 mainnet solana-validator[2238461]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.099064364Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solana-replay-stage" one=1i message="panicked at 'slot=138158379 must exist in ProgressMap', core/src/progress_map.rs:576:32" location="core/src/progress_map.rs:576:32" version="\"1.10.25 (src:d64f6808; feat:965221688)\""
Jun 19 14:55:26 mainnet systemd[1]: solana.service: Main process exited, code=killed, status=11/SEGV

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 17 (15 by maintainers)

Most upvoted comments

wao! mb hit congestion fees?

Technically, all of the actual debugging and reproduction was done on a testnet snapshot + ledger, and I can only definitively say that we hit the elevated value on testnet.

The initial issue report that was transcribed from Discord failed in the same way, but it is possible that something else could have caused the consensus deviation … not enough logs provided to know with certainty

steviez on Jul 7, 2022

A few updates:

I tried replaying from the full snapshot only (no incremental, and thus replayed across the epoch boundary), this produces the correct hash for slot 138773289
I tried creating a full snapshot at the same slot that the incremental snapshot is at (138773288). This produces the correct hash for 138773288, but the incorrect hash for slot 138773289.

From the two above, it seems that there is some state isn’t being captured in the snapshot, or that the re-establishment of state from snapshot isn’t working correctly. After brainstorming with @brooksprumo, I’m now going to try to directly compare banks, one from the original full snapshot with replay, and one from loading the snapshot directly. Bank implements == and Debug trait, so I should be able to check equivalence and/or get a print out of what might be differing.

steviez on Jul 1, 2022

Do you have a nice way to share huge files? I’ve uploaded it to a hetzner storage box and added your ssh keys (from https://github.com/steviez.keys).

You should be able to pull the log with rsync --progress -e 'ssh -p23' --recursive u308574@u308574.your-storagebox.de:ledger/validator-log.log log.log

The full ledger is uploading right now, but that’ll take at least 2h, likely 5-10h. When it’s done you can pull with

rsync --progress -e 'ssh -p23' --recursive u308574@u308574.your-storagebox.de:ledger ledger

Feel free to add other ssh keys if you need (unfortunately no file editor on that storage box, but you can push a new authorized keys file: https://docs.hetzner.com/robot/storage-box/backup-space-ssh-keys)

tlambertz on Jun 28, 2022

What is the expected behavior if the slot is not in ProgressMap?

@godmodegalactus - In this scenario, we think the slot should absolutely be in the ProgressMap. The fact that it isn’t means our assumptions about state aren’t valid / some state is corrupted, hence the hard panic

steviez on Jun 28, 2022