cosmos-sdk: Consensus fails when using statesync mode to synchronize the application state

Summary of Bug

Consensus fails when using statesync mode to synchronize the application state and then execute the ibc-transfer transaction.

Description

When the cosmos-sdk-based chain is started, the capability/keeper/keeper.go#L177:InitializeCapability(…) method will be called to initialize the memStore from the application store. However, if the node is started using statesync mode, the application store will not be loaded until the node is switched to fastsync mode. But in this case, the method InitializeCapability will not be called again to initialize memStore. Therefore, when calling the method capability/keeper/keeper.go#L344:GetCapability(…), the node started using statesync mode cannot get the same result as other node.

Steps to Reproduce

The GetCapability(…) mothod used in IBC module, so it can be reproduced through ibc-transfer:

  1. Start two testnets via gaia and create relayer for them, then create clients and channels. Refer: https://github.com/cosmos/relayer#demo

  2. Create node1, node2 to join testnet ibc-0:

    gaiad init node1 --home node1
    cp data/ibc-0/config/genesis.json node1/config/genesis.json
    
    gaiad init node2 --home node2
    cp data/ibc-0/config/genesis.json node2/config/genesis.json
    

    then update state-sync config in node1/config/app.toml and node2/config/app.toml:

    [state-sync]
    snapshot-interval = 100
    snapshot-keep-recent = 4
    

    Start node1 and node2:

    # NOTE: modify ports and add ibc-0 peer
    gaiad start --home node1
    
    # NOTE: modify ports and add ibc-0 peer
    gaiad start --home node2
    
  3. Create node3 to join testnet ibc-0.

    gaiad init node3 --home node3
    cp data/ibc-0/config/genesis.json node3/config/genesis.json
    

    Update config:

    # config.toml
    [statesync]
    enable = true
    
    rpc_servers = "ibc-0 node rpc"
    trust_height = 1
    trust_hash = "block 1 hash"
    trust_period = "168h0m0s"
    
  4. Send ibc-transfer

    rly tx transfer ibc-0 ibc-1 1000000samoleans $(rly chains address ibc-1)
    rly tx relay-packets demo -d
    
  5. Start node3 using statesync mode

    # NOTE: modify ports and add ibc-0 peer
    gaiad start --home node3
    

    Get consensus failure error on executing the ibc-transfer transaction:

    NOTE: if the latest block height is greater than the ibc-transfer tranaction exexuted height, no error is returned, you can unsafe-reset-all node3 and repeat steps 4-5.

    4:49PM INF committed state app_hash=0475A43BE9A8BD240551895B01A31C5B1ABACD710DC273D819B863C9F355804C height=34721 module=state num_txs=1
    4:49PM INF indexed block height=34721 module=txindex
    panic: Failed to process committed block (34722:51062BEB78119D7A5CF971B7FEC787C428E8347FE20C308BF98311C2F95BFA1B): wrong Block.Header.AppHash.  Expected 0475A43BE9A8BD240551895B01A31C5B1ABACD710DC273D819B863C9F355804C, got 2443E8D78F4B2025252055EDD384DBF80839893092110C0A5D072DCABED9FB17
    
    goroutine 135 [running]:
    github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).poolRoutine(0xc000548a80, 0xc0032dac01)
      github.com/tendermint/tendermint@v0.34.9/blockchain/v0/reactor.go:401 +0x15bf
    created by github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).SwitchToFastSync
      github.com/tendermint/tendermint@v0.34.9/blockchain/v0/reactor.go:125 +0xd8
    

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 28 (22 by maintainers)

Most upvoted comments

@AdityaSripal the node has been running without issue for 3 days already. No crash no restart.

I have a non-breaking fix up that will be able to fix the issue for the 0.42 line here: https://github.com/cosmos/cosmos-sdk/tree/aditya/cap-init

Here’s the diff: https://github.com/cosmos/cosmos-sdk/compare/v0.42.5...aditya/cap-init?expand=1

Unfortunately the fix I proposed above can only be done efficiently if we move the reverse mapping into the persistent store. The reverse mapping is deterministic so there’s no issue moving it, it’s just a breaking change. Once that is done, reconstructing the forward mapping and capmap on-the-fly is trivial. This fix should go into 0.43

I will write tests for this tomorrow, but in the meantime it would be great if someone is able to test it out and see if statesync works.

@chengwenxi you can connect to the following nodes. They both have snapshots.

ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7@139.177.178.149:26656 ee27245d88c632a556cf72cc7f3587380c09b469@45.79.249.253:26656

And if you need RPCs https://rpc.cosmoshub.forbole.com/ https://rpc.cosmoshub.bigdipper.live/

This is the issue right? Does NewApp get called for state sync? Or I guess any usage of capabilities during state sync is a problem?

In IBC capabilities are created at various times, sometimes during InitChain for binding ports, always during a channel handshake, and randomly by applications as they decide to bind to new port names.