tendermint: Consensus: "enterPrevote: ProposalBlock is invalid" - Error: "wrong signature"

Tendermint version:

Tendermint Core Semantic Version: 0.33.3 P2P Protocol Version: 7 Block Protocol Version: 10

"node_info": {
    "protocol_version": {
      "p2p": "7",
      "block": "10",
      "app": "0"
    },
    "id": "dbc39feecf277f59b4b16ae277e8545c54ac244a",
    "listen_addr": "tcp://0.0.0.0:26656",
    "network": "bluzelle",
    "version": "0.33.3",
    "channels": "4020212223303800",
    "moniker": "daemon-sentry-3",
    "other": {
      "tx_index": "on",
      "rpc_address": "tcp://0.0.0.0:26657"
    }

ABCI app:

Cosmos SDK Version: v0.38.3

"application_version": {
    "name": "BluzelleService",
    "server_name": "blzd",
    "client_name": "blzcli",
    "version": "0.0.0-74-ge1ee575",
    "commit": "e1ee575051ad2ea18ef22fc6bf7a6fc904612a49",
    "build_tags": "ledger,faucet,cosmos-sdk v0.38.3",
    "go": "go version go1.14.3 linux/amd64"
  }

Big Dipper Explorer URL:

http://explorer.testnet.public.bluzelle.com:3000

Instructions to setup a similar node (I’d suggest just setting up a sentry):

https://github.com/bluzelle/curium/blob/devel/docs/public/buildvalidatorsentry.md

Access to genesis file for chain:

http://a.sentry.bluzellenet.bluzelle.com:1317/genesis.json

Sample command to get node info:

curl --location --request GET 'http://a.sentry.testnet.public.bluzelle.com:1317/node_info'

Discord channel invite (in case you want to live chat with me… I am Neeraj one of the admins):

https://discord.gg/BbBZJZJ

Environment:

OS (e.g. from /etc/os-release):

Distributor ID: Ubuntu
Description:  Ubuntu 18.04.4 LTS
Release:  18.04
Codename: bionic

Install tools:

Using COSMOS SDK v0.38.3. Otherwise, not sure what else to say here.

Others:

We are running a testnet chain with our CRUD database as one of the application modules, in COSMOS.

We currently (as of filing this issue) have 5 “sentries” and 3 validators. To be clear, the sentries have no voting power and are the only peers that the validators talk to (the validators can talk to each other too). Furthermore, the validators are IP firewalled to only be able to talk to the sentries and other validators. The sentries themselves keep the validator node id’s private.

Sentry hostnames:

a.sentry.testnet.public.bluzelle.com
b.sentry.testnet.public.bluzelle.com
c.sentry.testnet.public.bluzelle.com
d.sentry.testnet.public.bluzelle.com
e.sentry.testnet.public.bluzelle.com

I am not listing the validator hostnames, since they are inaccessible (due to the firewall) anyways.

The validators are only listening on 26656 to validators and sentries. The sentries are listening on 26656 and 26657 and also each run the cosmos REST server, listening on 1317.

We have opened our testnet to the public. Members of the public have setup sentries and validators of their own, and are expected to use our five sentries as their P2P peers in config.toml.

What happened:

For weeks, things on our testnet had been running fine. I had dozens of members of the public running validators on it, just so these people could learn the process of setting up a validator, etc.

I needed to increase the max # of allowed validators (to something much higher than the default value of 100) in the “app_state/staking/params/max_validators” value in genesis.json. I think that this particular value is a COSMOS thing, but I wanted to mention it for context. We are not using COSMOS governance yet, so we decided to do a hard reset (ie: generate a new genesis.json and start the chain all over).

First, here is what I did on my OWN 5 sentries and 3 validators:

Stopped all my sentries and validators.
Wiped out their .blzd folders (this is the name of my “home” folder for my “blzd” daemons). Because of this, the nodes will all get new node ids and will be new “peers”.
Re-initialized each sentry and validator with “blzd init”, etc… much like I always do when I setup a validator or sentry from scratch (setting up peers, etc). I had also increased the “max_num_inbound_peers” to 800 and “max_num_outbound_peers” to 200, in the [p2p] section of config.toml. This might only be anecdotal in value. I had an issue where we had too many connects to my sentries and they were dropping connections on the p2p port.
Generated the new genesis.
Deployed this genesis to all the sentries and validators.
Run the necessary COSMOS commands to get the validators staked, created, etc.
Start up all my sentries and validators, (thereby starting up the new chain from block 0).

Next, here is what I asked the people in the community to do with their validator and/or sentries:

Run “blzd unsafe-reset-all” on all their daemons. I asked the community do this instead of wiping out the “.blzd” folder, to save them some work.
Copy over the new genesis.json file, replacing the old genesis.json.
Set the new peers list in the p2p section of config.toml.
Run the necessary COSMOS commands to get the validators staked, created, etc.
Start up their sentries and validators.

The community slowly started up their daemons.

At some point (within an hour or so, about 2300 blocks in), I started to get the error below. I was getting this on all my sentries and validators. Basically, the chain had completely crashed. I tried to restart my validators and sentries, but this was unrecoverable.

E[2020-05-29|03:03:32.975] enterPrevote: ProposalBlock is invalid       module=consensus height=2285 round=0 err="wrong signature (#35): C683341000384EA00A345F9DB9608292F65EE83B51752C0A375A9FCFC2BD895E0792A0727925845DC13BA0E208C38B7B12B2218B2FE29B6D9135C53D7F253D05"
E[2020-05-29|03:03:35.128] enterPrevote: ProposalBlock is invalid       module=consensus height=2285 round=1 err="wrong signature (#35): C683341000384EA00A345F9DB9608292F65EE83B51752C0A375A9FCFC2BD895E0792A0727925845DC13BA0E208C38B7B12B2218B2FE29B6D9135C53D7F253D05"
E[2020-05-29|03:03:37.255] enterPrevote: ProposalBlock is invalid       module=consensus height=2285 round=2 err="wrong signature (#35): C683341000384EA00A345F9DB9608292F65EE83B51752C0A375A9FCFC2BD895E0792A0727925845DC13BA0E208C38B7B12B2218B2FE29B6D9135C53D7F253D05"
.
.
.

To had no choice but to “reset” the whole chain again. I stopped all my validators and sentries and this time, ONLY ran “unsafe-reset-all” on all my daemons. Of course, I also had to do some COSMOS setup again (staking, etc), but started everything again, and asked the community to again do the same steps listed above with yet a new genesis.json, etc.

Within an hour, the whole network went down again. Effectively the same error (different block, signature HASH this time):

.
.
.
E[2020-05-29|06:57:48.621] enterPrevote: ProposalBlock is invalid       module=consensus height=676 round=146 err="wrong signature (#8): 62A6A628CFB1F72D76C48F71A928DD628E29585DD4B861EDF3F216E77FBB0A7C492D2280B218FBA34A0751F02961C2657708711D3F212800CFE847B804F0360D
.
.
.

What you expected to happen:

I expect “clean” output, as so:

I[2020-05-30|21:53:04.286] Executed block                               module=state height=20416 validTxs=0 invalidTxs=2
I[2020-05-30|21:53:04.309] Committed state                              module=state height=20416 txs=2 appHash=80D70DC5FF062F34D3F79F15FC85CB367A5A7F9CF39B4EE6C1DC68E9F1958EA1

(The fact that invalidTxs is non-zero is the subject of another investigation)

Have you tried the latest version:

Not sure. I think so. Although in looking at the Tendermint Github, I see there are two minor versions available that are newer than what we have.

How to reproduce it:

I more or less explained how it came about above in the “what happened” section.

Looking at #2720, I see a similar error message, but not quite the same. But in looking at that issue, it was suggested that perhaps all the nodes did not start from the same “genesis” state. It is suggested that perhaps some node(s) have a stale “home folder” (.blzd, I presume?).

Does “unsafe-reset-all” actually clear out all state including the COSMOS KV stores, app state, etc? I assume this command is sufficient to accomplish a clean slate?

Is it possible that in “resetting” my chain as I did above, some members of the public possibly forgot to run that “blzd unsafe-reset-all” command, and by missing this step, when they started their node, it had data from the previous chain left over, and this somehow brought the whole network down? If so, it is a bit scary that a single node (or even a bunch of them) could do this. It is an excellent DoS attack vector, it seems, if so.

Logs:

Listed above.

Config:

No specific changes made to Tendermint.

node command runtime flags:

This is all running from within our daemon that was built with the COSMOS SDK.

/dump_consensus_state output for consensus bugs

Not sure how to do this.

Anything else we need to know:

Most details given above.

I did some searching ahead of time to see if I could resolve this myself. I saw some issues related to it but they are already closed.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 6
Comments: 17 (10 by maintainers)

Commits related to this issue

more test cases for TestValidatorSet_VerifyCommit Refs #4926 — committed to tendermint/tendermint by melekes 4 years ago
types: more test cases for TestValidatorSet_VerifyCommit (#5018) Refs #4926 — committed to tendermint/tendermint by melekes 4 years ago
consensus: Do not allow signatures for a wrong block in commits Closes #4926 The dump consensus state had this: "last_commit": { "votes": [ "Vote{0:04CBBF43CA3E 385085/00/2(... — committed to tendermint/tendermint by melekes 4 years ago
.github/issue_template: Update `/dump_consensus_state` request. (#5060) Clarify how to get the `/dump_consensus_state` data. Eg. https://github.com/tendermint/tendermint/issues/4926 indicated: "N... — committed to tendermint/tendermint by ebuchman 4 years ago

Most upvoted comments

@melekes

We are running a “Game of Stakes” type competition right now, so I am a bit heads down. But I want to quash/explain this issue, as it is worrisome and should be to others (if it is legitimate and not something dumb on my side).

I will try to make it happen intentionally in the next few weeks. I have the ability to launch new testnets pretty quickly.

njmurarka on Jun 9, 2020

Thanks, this is super helpful. Looks like we’ve identified the problem and managed to replicate this issue. Will publish a fix ASAP.

ebuchman on Jun 29, 2020

Yes we will share details on all of that once the fix is released.

And we will try to provide a script that should allow you to save your existing testnet. Thanks for your patience!

ebuchman on Jun 29, 2020