solana: bank hash mismatch causing dropped votes
DR6 failed because more than 33% of the validators somehow generated an inconsistent bank hash which caused the remainder of the cluster to reject their votes: https://github.com/solana-labs/solana/blob/719785a8d307b4269497aa82b148109e491019d5/programs/vote/src/vote_state.rs#L275-L279
The ledgers from both groups appear to be the same, and when a ledger from the inconsistent banh hash group is run through solana-ledger-tool verify the correct bank hash as produced. So there appears to be a runtime race condition that is causing https://github.com/solana-labs/solana/blob/c33b54794cb529e277870fea3b21185f37a9f802/runtime/src/bank.rs#L677 to vary.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 15 (15 by maintainers)
Admittedly, this was hard to repro.
However, finally and finally it seems that I could manged to repro this locally… (stay tuned)
It just means when he sent the repair response that some kind of error was encountered when trying to send a packet. I’m not really sure exactly what can cause that, maybe a full udp buffer or some kind of network driver/kernel issue. But the requesting node would ask other nodes to also send it to him at some point, so it shouldn’t be fatal.
Maybe if every node on the network somehow got into this state where they could not send anything, but that doesn’t seem to be the case.
I would like to understand better about what can cause this.
Another point of note is that when the ledger from those errant nodes is replayed using
solana-ledger-toolit accepts the bank hashes for the majority and rejects its own votes, implying the ledger data itself is in fact correct and consistent across all machinesPhew, I’ve found something interesting. Stay tuned!