cosmos-sdk: Network stops producing blocks after upgrade from v0.45.x to v0.46.0-rc1
Summary of Bug
After completing the software upgrade using the fix and instructions provided here : https://github.com/cosmos/cosmos-sdk/pull/12028, a multi node network stops producing blocks once the upgrade handler is applied. All the nodes present in the network lose their p2p connections and do not attempt to dial the node addresses specified in the persisten_peers.
6:29AM INF not caught up yet height=151 max_peer_height=0 module=blockchain timeout_in=997.693432
6:29AM ERR no progress since last advance last_advance=2022-05-25T06:28:33Z module=blockchain
6:29AM INF switching to consensus module=consensus
6:29AM INF starting service impl=ConsensusState module=consensus service=State
6:29AM INF starting service impl=baseWAL module=consensus service=baseWAL wal=/root/.simapp/data/cs.wal/wal
6:29AM INF starting service impl=Group module=consensus service=Group wal=/root/.simapp/data/cs.wal/wal
6:29AM INF starting service impl=TimeoutTicker module=consensus service=TimeoutTicker
6:29AM INF Searching for height height=151 max=0 min=0 module=consensus wal=/root/.simapp/data/cs.wal/wal
6:29AM INF Searching for height height=150 max=0 min=0 module=consensus wal=/root/.simapp/data/cs.wal/wal
6:29AM INF Found height=150 index=0 module=consensus wal=/root/.simapp/data/cs.wal/wal
6:29AM INF Catchup by replaying consensus messages height=151 module=consensus
6:29AM INF Replay: Done module=consensus
6:29AM INF Timed out dur=-59039.217293 height=151 module=consensus round=0 step=1
6:29AM INF received proposal module=consensus proposal={"Type":32,"block_id":{"hash":"E72AEB39BA6E73197AE4EB94D37699544FBFD4C03BEB43AE1BF8E23EDC9B6AC1","parts":{"hash":"2A75B30323352A17714584961D9426734304D1941F19589E5778666ECB730991","total":1}},"height":151,"pol_round":-1,"round":0,"signature":"IgKgHu9ZjzjTga4Y6uXmv7faVVLAEVZapWV3glSb9hmGx8cnl6thJOVw5LvUIMsquadL0SZdfZ54u9JvB5H/Ag==","timestamp":"2022-05-25T06:29:33.398587982Z"}
6:29AM INF received complete proposal block hash=E72AEB39BA6E73197AE4EB94D37699544FBFD4C03BEB43AE1BF8E23EDC9B6AC1 height=151 module=consensus
6:29AM INF Timed out dur=3000 height=151 module=consensus round=0 step=3
The log snippet posted above was taken from the validator which was the proposer of the upgrade height + 1 block. It made no attempts to establish a peer connection with the rest of the nodes and stalled at that point.
6:29AM INF switching to consensus module=consensus
6:29AM INF starting service impl=ConsensusState module=consensus service=State
6:29AM INF starting service impl=baseWAL module=consensus service=baseWAL wal=/root/.simapp/data/cs.wal/wal
6:29AM INF starting service impl=Group module=consensus service=Group wal=/root/.simapp/data/cs.wal/wal
6:29AM INF starting service impl=TimeoutTicker module=consensus service=TimeoutTicker
6:29AM INF Searching for height height=151 max=0 min=0 module=consensus wal=/root/.simapp/data/cs.wal/wal
6:29AM INF Searching for height height=150 max=0 min=0 module=consensus wal=/root/.simapp/data/cs.wal/wal
6:29AM INF Found height=150 index=0 module=consensus wal=/root/.simapp/data/cs.wal/wal
6:29AM INF Catchup by replaying consensus messages height=151 module=consensus
6:29AM INF Replay: Done module=consensus
6:29AM INF Timed out dur=-59036.332995 height=151 module=consensus round=0 step=1
6:29AM INF Timed out dur=3000 height=151 module=consensus round=0 step=3
The log snippet posted above was present in the rest of the validator nodes of the network.
Number of p2p connections of all the nodes were verified using curl localhost:26657/net_info | jq .result.n_peers which returned 0 in all cases.
The migration using upgrade handler was verified by observing the logs
6:28AM INF applying upgrade "v045-to-v046" at height: 150
6:28AM INF migrating module authz from version 1 to version 2
6:28AM INF migrating module bank from version 2 to version 3
6:28AM INF migrating module feegrant from version 1 to version 2
6:28AM INF migrating module gov from version 2 to version 3
6:28AM INF adding a new module: group
6:28AM INF adding a new module: nft
6:28AM INF migrating module staking from version 2 to version 3
6:28AM INF migrating module upgrade from version 1 to version 2
6:28AM INF minted coins from module account amount=1441stake from=mint module=x/bank
6:28AM INF executed block height=150 module=consensus num_invalid_txs=0 num_valid_txs=0
6:28AM INF commit synced commit=436F6D6D697449447B5B363420313831203130362032303720333720363520343520313435203731203920323531203233362039302034
3120363320313232203139382031302031383020343120362032343920313738203232302032353520393920313937203139392031373320313030203930203138315D3A39367D
6:28AM INF committed state app_hash=40B56ACF25412D914709FBEC5A293F7AC60AB42906F9B2DCFF63C5C7AD645AB5 height=150 module=consensus num_txs=0
6:28AM INF Completed ABCI Handshake - Tendermint and App are synced appHash="�<.��$�&ΰ\b���\x11-�\x00=\"swբy�z_X���" appHeight=149 module=cons
ensus
6:28AM INF Version info block=11 mode=validator p2p=8 tmVersion=0.35.0-unreleased
6:28AM INF This node is a validator addr=965F201169F1B5C975C026CCC9A0A5F8D8DC7578 module=consensus pubKey="�H������\x12\x1e\x11\x1b�P%EW�\x1e:
�\a\x15}���c�h��"
This issue does not occur on a localnet with a one node network
Version
https://github.com/cosmos/cosmos-sdk/pull/12028
Steps to Reproduce
For Admin Use
- Not duplicate issue
- Appropriate labels applied
- Appropriate contributors tagged
- Contributor assigned/self-assigned
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 15 (13 by maintainers)
@kaustubhkapatral do you mind dumping a copy of the tendermint
config.tomlfile?Ok so looking a bit deeper at
NetInfo, it seems that it should return all the addresses in thepeerStorenot just the peers a node is currently connected to. This means that the nodes aren’t successfully adding addresses that were stated in theconfig.toml.If you move the list of peers from
persistent_peerstobootstrap_peersand run the nodes again does it start to dial?The other thing I can try to do is add logs to check that the addresses are being added in a local testnet