solana: v1.10: 'solana-replay-stage' panicked at 'slot=XXX must exist in ProgressMap', core/src/progress_map.rs:576:32
The following panic was reported on v1.10.25 (src:d64f6808; feat:965221688) by a mainnet validator:
un 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.036263075Z WARN solana_core::replay_stage] purging slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.065577144Z WARN solana_core::replay_stage] purging descendant: 138158378 with slot_id 508786, of slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.067850205Z WARN solana_core::replay_stage] purging descendant: 138158377 with slot_id 508785, of slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.069361795Z WARN solana_core::replay_stage] purging descendant: 138158379 with slot_id 508788, of slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.072372591Z WARN solana_core::replay_stage] purging descendant: 138158376 with slot_id 508783, of slot 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.073289748Z WARN solana_core::replay_stage] Notifying repair service to repair duplicate slot: 138158376
Jun 19 14:55:13 mainnet solana-validator[2238461]: thread 'solana-replay-stage' panicked at 'slot=138158379 must exist in ProgressMap', core/src/progress_map.rs:576:32
Jun 19 14:55:13 mainnet solana-validator[2238461]: stack backtrace:
Jun 19 14:55:13 mainnet solana-validator[2238461]: 0: rust_begin_unwind
Jun 19 14:55:13 mainnet solana-validator[2238461]: at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:498:5
Jun 19 14:55:13 mainnet solana-validator[2238461]: 1: core::panicking::panic_fmt
Jun 19 14:55:13 mainnet solana-validator[2238461]: at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:116:14
Jun 19 14:55:13 mainnet solana-validator[2238461]: 2: solana_core::progress_map::ProgressMap::get_leader_propagation_slot_must_exist
Jun 19 14:55:13 mainnet solana-validator[2238461]: 3: solana_core::replay_stage::ReplayStage::retransmit_latest_unpropagated_leader_slot
Jun 19 14:55:13 mainnet solana-validator[2238461]: 4: solana_core::replay_stage::ReplayStage::new::{{closure}}
Jun 19 14:55:13 mainnet solana-validator[2238461]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Jun 19 14:55:13 mainnet solana-validator[2238461]: [2022-06-19T14:55:13.099064364Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solana-replay-stage" one=1i message="panicked at 'slot=138158379 must exist in ProgressMap', core/src/progress_map.rs:576:32" location="core/src/progress_map.rs:576:32" version="\"1.10.25 (src:d64f6808; feat:965221688)\""
Jun 19 14:55:26 mainnet systemd[1]: solana.service: Main process exited, code=killed, status=11/SEGV
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 17 (15 by maintainers)
Technically, all of the actual debugging and reproduction was done on a testnet snapshot + ledger, and I can only definitively say that we hit the elevated value on testnet.
The initial issue report that was transcribed from Discord failed in the same way, but it is possible that something else could have caused the consensus deviation … not enough logs provided to know with certainty
A few updates:
138773289138773288). This produces the correct hash for138773288, but the incorrect hash for slot138773289.From the two above, it seems that there is some state isn’t being captured in the snapshot, or that the re-establishment of state from snapshot isn’t working correctly. After brainstorming with @brooksprumo, I’m now going to try to directly compare banks, one from the original full snapshot with replay, and one from loading the snapshot directly.
Bankimplements==andDebugtrait, so I should be able to check equivalence and/or get a print out of what might be differing.Do you have a nice way to share huge files? I’ve uploaded it to a hetzner storage box and added your ssh keys (from https://github.com/steviez.keys).
You should be able to pull the log with
rsync --progress -e 'ssh -p23' --recursive u308574@u308574.your-storagebox.de:ledger/validator-log.log log.logThe full ledger is uploading right now, but that’ll take at least 2h, likely 5-10h. When it’s done you can pull with
rsync --progress -e 'ssh -p23' --recursive u308574@u308574.your-storagebox.de:ledger ledgerFeel free to add other ssh keys if you need (unfortunately no file editor on that storage box, but you can push a new authorized keys file: https://docs.hetzner.com/robot/storage-box/backup-space-ssh-keys)
@godmodegalactus - In this scenario, we think the slot should absolutely be in the ProgressMap. The fact that it isn’t means our assumptions about state aren’t valid / some state is corrupted, hence the hard panic