tendermint-rs: Wrong validation of LightBlock's commits
I have a failing MBT test for the LightClient verifier, with an empty commit in the LightBlock.
This is how the test fails:
> step 2, expecting Invalid
> lite: NotEnoughTrust(NotEnoughTrust(VotingPowerTally { total: 100, tallied: 0, trust_threshold: TrustThresholdFraction { numerator: 1, denominator: 3 } }))
The relevant test part is this:
[current |->
[Commits |-> {},
header |->
[NextVS |-> { "n1", "n2", "n3", "n4" },
VS |-> { "n2", "n3" },
height |-> 5,
lastCommit |-> { "n1", "n3" },
time |-> 6]],
now |-> 1401,
verdict |-> "INVALID",
verified |->
[Commits |-> { "n1", "n2", "n4" },
header |->
[NextVS |-> { "n2", "n3" },
VS |-> { "n1", "n2", "n4" },
height |-> 2,
lastCommit |-> { "n1", "n2", "n3", "n4" },
time |-> 3]]]
current
here is the untrusted header, and verified
is the trusted header, which are submitted to the verifier.verify()
function. From the point of view of the spec, and also from my personal understanding, an untrusted signed header with an empty commit should be rejected as invalid. The implementation, on the other hand, returns the NotEnoughTrust
verdict. My understanding is that this comes from the specific order of executing checks in this code, probably for performance reasons, as verifying the signatures is costly.
We could address this from different angles:
- Change the spec, to allow more behaviors in this case. While this is doable, in my view a signed header with an empty commit still should be under the
Invalid
category, and not underNotEnoughTrust
. - Change the order of checks in the code; don’t know whether this is doable or acceptable from the performance point of view.
- Add some lightweight checks that would verify, e.g., that the commit contains more than 1/3 of votes from the validator set, without actually verifying the signatures. This is going to be lightweight from the performance point of view, but would make the code a bit more complicated.
I think @romac should probably decide here; I am open for discussions.
P.S. The English spec (see especially the text below under “Details of the Functions”) is also not very precise about how this is computed (e.g. in multiple places it talks about the number of validators instead of their voting power), so, probably, it should be also improved. (cc. @josef-widder?)
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 25 (25 by maintainers)
Commits related to this issue
- Ensure there is at least one non-absent signatures in the commit. See #650 — committed to informalsystems/tendermint-rs by romac 4 years ago
- Fix bug where a commit with only absent signatures would be deemed valid instead of invalid (#652) * Ensure there is at least one non-absent signatures in the commit. See #650 * Update changelog ... — committed to informalsystems/tendermint-rs by romac 4 years ago
Ok, now I understand more, and I think this is a real bug in the implementation.
I actually have two distinct tests, where there is an empty commit. The one above fails, but the one below passes:
This one has the outcome:
Examining this outcome points to
CommitValidator.validate()
.The reason why this check fires for the second case is that the validator set in empty here, but non-empty in the first case. This results in the following translations by the Testgen:
First case:
Second case:
As can be seen, Testgen actually models the absent signatures as
BlockIDFlagAbsent
, and this case is not accounted for byCommitValidator.validate()
.Note this is already how the protocol works for the onchain IBC handler, which is the more important case anyways. We’ve been trying to align how we think about the light node to be closer to the IBC handler, and this would be a significant step in that direction as well.
@konnov I was in the process of replacing our JSON fixtures with TLA+ tests, and wrote one TLA+ test that tests for <2/3 validator power in the commit, and another test that has an empty commit. In fact, both of those were kind of “flickering” – on one execution of Apalache it’s passing, on another it fails… What I’ve definitely learned so far is not to ignore any failing test;) Because it’s so simple to rerun Apalache, and the test will pass…
What this also shows, I think, is that we should eventually incorporate running Apalache as part of the CI, not only running the static auto-generated tests. Because each time you run the model checker, the produced test is a bit different.
Re: https://github.com/informalsystems/tendermint-rs/pull/652#pullrequestreview-516103001
That’s cool! I am glad that problem we’ve discovered helps to rethink the protocols, this is the best outcome possible. I agree that in the current LightClient protocol doesn’t look repairable to any satisfactory state – for a malicious peer there will always be a possibility to trigger long computations. The idea with changing the client-server protocol also sounds very nice to me! This will mean that the misbehavior of a full node will be detectable in 1 step – if it provides a header that’s not immediately verifiable by the LightClient.
One issue with what we’ve found is that with our current single-step MBT tests we are looking only at one step of the whole protocol, i.e. it’s a very local view, which actually doesn’t tell the whole story. So we need to sooner than later do the bisection tests. What also comes to my mind wrt. this is that we probably want MBT to drive performance tests for us. Here it would generate different scenarios/schedules for the protocols, but besides checking the logical correctness we will also measure wall-clock performance, and store the outliers for later analysis. One more point for integrating MBT as a continuously running service generating tests.
…and record the failing ones? Yes, that may work.
Awesome work, @andrey-kuprianov!
Yep, that’s correct.
Awesome, will take a look at it asap!
I think we want both modes probably: with initialization for predictability, and without – for variability;)