solana: bank hash mismatch causing dropped votes

DR6 failed because more than 33% of the validators somehow generated an inconsistent bank hash which caused the remainder of the cluster to reject their votes: https://github.com/solana-labs/solana/blob/719785a8d307b4269497aa82b148109e491019d5/programs/vote/src/vote_state.rs#L275-L279

The ledgers from both groups appear to be the same, and when a ledger from the inconsistent banh hash group is run through solana-ledger-tool verify the correct bank hash as produced. So there appears to be a runtime race condition that is causing https://github.com/solana-labs/solana/blob/c33b54794cb529e277870fea3b21185f37a9f802/runtime/src/bank.rs#L677 to vary.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 15 (15 by maintainers)

Most upvoted comments

Admittedly, this was hard to repro.

However, finally and finally it seems that I could manged to repro this locally… (stay tuned)

[2020-01-14T09:37:40.088743523Z INFO  solana_core::replay_stage] new fork:1766 parent:1765 root:1734
thread '<unnamed>' panicked at 'ADvf3zXD87FfKV29mFCvnXiB7qMQRcQmQCET2qh4Vh74 dropped vote Vote { slots: [1765], hash: pRJynJGswe7gf2VryGfBFNVQvwgppMi3kCCqEqff2DQ, timestamp: None } failed to match hash pRJynJGswe7gf2VryGfBFNVQvwgppMi3kCCqEqff2DQ ZVuqHYbaMbbC3KAggriST6kyPPqQstT2dsAR9ZGScdM', programs/vote/src/vote_state.rs:237:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

@sakridge Thanks for checking out! So, it seems that these log patterns are harmless.

Another suspicious error is:

log from 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af:
[2020-01-09T11:57:09.623904040Z INFO  solana_core::repair_service] 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af repair req send_to(XXX.XXX.XXX.XXX:XXX) error Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }
[2020-01-09T11:57:09.623915739Z INFO  solana_core::repair_service] 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af repair req send_to(XXX.XXX.XXX.XXX:XXX) error Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }
[2020-01-09T11:57:09.623924490Z INFO  solana_core::repair_service] 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af repair req send_to(XXX.XXX.XXX.XXX:XXX) error Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }

This occurs a lot only on 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af; and it seems there is no corresponding error at the other side validator. Does the requesting validator can really correctly handle error?

It just means when he sent the repair response that some kind of error was encountered when trying to send a packet. I’m not really sure exactly what can cause that, maybe a full udp buffer or some kind of network driver/kernel issue. But the requesting node would ask other nodes to also send it to him at some point, so it shouldn’t be fatal.

Maybe if every node on the network somehow got into this state where they could not send anything, but that doesn’t seem to be the case.

I would like to understand better about what can cause this.

Another point of note is that when the ledger from those errant nodes is replayed using solana-ledger-tool it accepts the bank hashes for the majority and rejects its own votes, implying the ledger data itself is in fact correct and consistent across all machines

Phew, I’ve found something interesting. Stay tuned!