tendermint-rs: Wrong validation of LightBlock's commits

I have a failing MBT test for the LightClient verifier, with an empty commit in the LightBlock.

This is how the test fails:

    > step 2, expecting Invalid
      > lite: NotEnoughTrust(NotEnoughTrust(VotingPowerTally { total: 100, tallied: 0, trust_threshold: TrustThresholdFraction { numerator: 1, denominator: 3 } }))

The relevant test part is this:

[current |->
    [Commits |-> {},
      header |->
        [NextVS |-> { "n1", "n2", "n3", "n4" },
          VS |-> { "n2", "n3" },
          height |-> 5,
          lastCommit |-> { "n1", "n3" },
          time |-> 6]],
  now |-> 1401,
  verdict |-> "INVALID",
  verified |->
    [Commits |-> { "n1", "n2", "n4" },
      header |->
        [NextVS |-> { "n2", "n3" },
          VS |-> { "n1", "n2", "n4" },
          height |-> 2,
          lastCommit |-> { "n1", "n2", "n3", "n4" },
          time |-> 3]]]

current here is the untrusted header, and verified is the trusted header, which are submitted to the verifier.verify() function. From the point of view of the spec, and also from my personal understanding, an untrusted signed header with an empty commit should be rejected as invalid. The implementation, on the other hand, returns the NotEnoughTrust verdict. My understanding is that this comes from the specific order of executing checks in this code, probably for performance reasons, as verifying the signatures is costly.

We could address this from different angles:

Change the spec, to allow more behaviors in this case. While this is doable, in my view a signed header with an empty commit still should be under the Invalid category, and not under NotEnoughTrust.
Change the order of checks in the code; don’t know whether this is doable or acceptable from the performance point of view.
Add some lightweight checks that would verify, e.g., that the commit contains more than 1/3 of votes from the validator set, without actually verifying the signatures. This is going to be lightweight from the performance point of view, but would make the code a bit more complicated.

I think @romac should probably decide here; I am open for discussions.

P.S. The English spec (see especially the text below under “Details of the Functions”) is also not very precise about how this is computed (e.g. in multiple places it talks about the number of validators instead of their voting power), so, probably, it should be also improved. (cc. @josef-widder?)

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 25 (25 by maintainers)

Commits related to this issue

Ensure there is at least one non-absent signatures in the commit. See #650 — committed to informalsystems/tendermint-rs by romac 4 years ago
Fix bug where a commit with only absent signatures would be deemed valid instead of invalid (#652) * Ensure there is at least one non-absent signatures in the commit. See #650 * Update changelog ... — committed to informalsystems/tendermint-rs by romac 4 years ago

Most upvoted comments

Ok, now I understand more, and I think this is a real bug in the implementation.

I actually have two distinct tests, where there is an empty commit. The one above fails, but the one below passes:

[current |->
    [Commits |-> {},
      header |->
        [NextVS |-> { "n3", "n5", "n6", "n7", "n8" },
          VS |-> {},
          height |-> 3,
          lastCommit |->
            { "n1", "n10", "n2", "n3", "n4", "n5", "n6", "n7", "n8", "n9" },
          time |-> 2]],
  now |-> 4,
  verdict |-> "INVALID",
  verified |->
    [Commits |->
        { "n1", "n10", "n2", "n3", "n4", "n5", "n6", "n7", "n8", "n9" },
      header |->
        [NextVS |-> { "n1", "n10", "n2", "n4", "n9" },
          VS |->
            { "n1", "n10", "n2", "n3", "n4", "n5", "n6", "n7", "n8", "n9" },
          height |-> 1,
          lastCommit |-> {},
          time |-> 1]]]

This one has the outcome:

    > step 0, expecting Invalid
      > lite: Invalid(ImplementationSpecific("no signatures for commit"))

Examining this outcome points to CommitValidator.validate().

The reason why this check fires for the second case is that the validator set in empty here, but non-empty in the first case. This results in the following translations by the Testgen:

First case:

          "commit": {
            "height": "5",
            "round": 1,
            "block_id": {
              "hash": "43656DB8FE2E7A7C7F0FBEFED072538FEBF7CF66447E850B60A5242E2E832732",
              "parts": {
                "total": 0,
                "hash": "0000000000000000000000000000000000000000000000000000000000000000"
              }
            },
            "signatures": [
              {
                "block_id_flag": 1,
                "validator_address": null,
                "timestamp": null,
                "signature": null
              },
              {
                "block_id_flag": 1,
                "validator_address": null,
                "timestamp": null,
                "signature": null
              }
            ]
          }
        },

Second case:

          "commit": {
            "height": "3",
            "round": 1,
            "block_id": {
              "hash": "DD774230DAE6853ED35710C7FCC735F33B9A97E09A8E25A7F5495DF61BF13727",
              "parts": {
                "total": 0,
                "hash": "0000000000000000000000000000000000000000000000000000000000000000"
              }
            },
            "signatures": []
          }
        },

As can be seen, Testgen actually models the absent signatures as BlockIDFlagAbsent, and this case is not accounted for by CommitValidator.validate().

andrey-kuprianov on Oct 22, 2020

Note this is already how the protocol works for the onchain IBC handler, which is the more important case anyways. We’ve been trying to align how we think about the light node to be closer to the IBC handler, and this would be a significant step in that direction as well.

ebuchman on Oct 26, 2020

@konnov I was in the process of replacing our JSON fixtures with TLA+ tests, and wrote one TLA+ test that tests for <2/3 validator power in the commit, and another test that has an empty commit. In fact, both of those were kind of “flickering” – on one execution of Apalache it’s passing, on another it fails… What I’ve definitely learned so far is not to ignore any failing test;) Because it’s so simple to rerun Apalache, and the test will pass…

What this also shows, I think, is that we should eventually incorporate running Apalache as part of the CI, not only running the static auto-generated tests. Because each time you run the model checker, the produced test is a bit different.

andrey-kuprianov on Oct 22, 2020

Re: https://github.com/informalsystems/tendermint-rs/pull/652#pullrequestreview-516103001

That’s cool! I am glad that problem we’ve discovered helps to rethink the protocols, this is the best outcome possible. I agree that in the current LightClient protocol doesn’t look repairable to any satisfactory state – for a malicious peer there will always be a possibility to trigger long computations. The idea with changing the client-server protocol also sounds very nice to me! This will mean that the misbehavior of a full node will be detectable in 1 step – if it provides a header that’s not immediately verifiable by the LightClient.

One issue with what we’ve found is that with our current single-step MBT tests we are looking only at one step of the whole protocol, i.e. it’s a very local view, which actually doesn’t tell the whole story. So we need to sooner than later do the bisection tests. What also comes to my mind wrt. this is that we probably want MBT to drive performance tests for us. Here it would generate different scenarios/schedules for the protocols, but besides checking the logical correctness we will also measure wall-clock performance, and store the outliers for later analysis. One more point for integrating MBT as a continuously running service generating tests.

andrey-kuprianov on Oct 24, 2020

…and record the failing ones? Yes, that may work.

konnov on Oct 22, 2020

Awesome work, @andrey-kuprianov!

If I correctly understand the implications, in the function verify_to_target(), incorrectly classifying the LightBlock as NotEnoughTrust instead of Invalid has pretty severe consequences: the block will be added to the light store, and on any further attempt to get the block for this height it will be fetched from the light store, i.e. it will be stuck there forever.

@romac could you please confirm whether I understand the things right?

Yep, that’s correct.

@romac FYI: this PR now contains a static auto-generated test that fails as described above. The branch is updated to the latest master.

Awesome, will take a look at it asap!

romac on Oct 22, 2020

I think we want both modes probably: with initialization for predictability, and without – for variability;)

andrey-kuprianov on Oct 22, 2020