cosmos-sdk: Gaiad network crash in executing redelegation transactions

Summary of Bug

When I tried to send redelegate (maybe unbond can also reproduce this issue?) transactions twice between two given validators, the blockchain network encountered a consensus failure and failed to produce new blocks.

E[09-05|07:45:54.249] Error on ApplyBlock. Did the application crash? Please restart tendermint module=consensus err="Commit failed for application: Error changing validator set: Failed to remove validator BE3D48310A7CB5284A6FF73A48A9F0E1E2CC25A5"

The above error log locates in tendermint/state/execution.go

Code for reproduce

Steps to Reproduce

  1. Create gaia network with two nodes on the same machine. Only the first node is the validator.
gaiad init --name testA --home $HOME/testA --chain-id find-bug
gaiad init --name testB --home $HOME/testB --chain-id find-bug
cp $HOME/testA/config/genesis.json $HOME/testB/config/genesis.json
gaiad start --home $HOME/testA
  1. Edit $HOME/testB/config/config.toml
    1. Assign testA’s p2p address to persistent_peers, for instance:
    # Comma separated list of nodes to keep persistent connections to
    persistent_peers = "81ce2efeefbf9634d99063b4c704a9ee9dd044c7@10.0.2.15:26656"
    
    1. Change the following items to avoid ports conflict:
    proxy_app = "tcp://127.0.0.1:26648"
    rpc.laddr = "tcp://0.0.0.0:26647"
    p2p.laddr = "tcp://0.0.0.0:26646"
    
  2. Start testB node. Now testB is not a validator.
  3. Create new validator:
    1. Get public key, mard ad publicKeyB:
    gaiad tendermint show-validator --home $HOME/testB
    
    1. Send token to addrB:
    gaiacli send --to=<addrB> --from=testA --amount=10steak --chain-id=find-bug
    
    1. Send transction to create new validator:
    gaiacli stake create-validator --address-delegator=<addrB> --chain-id=find-bug --from=testB --    pubkey=<publicKeyB> --amount 5steak
    
    1. Query the validators, there will be two validators, mark validator address as validatorA and validatorB:
    gaiacli stake validators
    
  4. Send redelegate transactions:
    1. First transaction:
    gaiacli stake redelegate begin --addr-validator-source=<validatorB> --addr-validator-dest=<validatorA > --chain-id find-bug --shares-percent=0.95 --from testB
    
    1. Second redelegate:
    gaiacli stake redelegate begin --addr-validator-source=<validatorB> --addr-validator-dest=<validatorA > --chain-id find-bug --shares-percent=0.03 --from testB
    
  5. Then the blockchain network will be crash.

Analysis of Bug

Currently, in staking module, we use sdk.Dec as the date type of token amount and shares. When calculating voting power, we convert it to an int64.

v.BondedTokens().RoundInt64()

After the first redelegation is done, the remaining bonded token on validatorB is 0.025 and its equivalent voting power is zero. So once the first delegation is done, validatorB will be removed from the validator set in Tendermint.

When the second redelegation transaction is executed, the EndBlock will produce a new validator set change: set validatorB voting power to zero again.

However, the validatorB has already removed from validator set. The second remove operation will cause the fatal error.

Ideas about bug fix

The simplest way to fix this issue is to change code in tendermint: check if the validator exist before executing remove operation.

Maybe we can also fix this bug in staking module. Currently in staking, only the the validator bonded token is zero, will the validator be removed. Maybe here we should take its voting power into consideration.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 23 (23 by maintainers)

Most upvoted comments

So how about just remove the validator in staking if its rounding voting power is zero.

We can’t do this as the validator is still bonded and rightfully so. What @rigelrozanski is suggesting is cleaner and better approach. I’ll amend my PR shortly. Thanks all 👍

Thanks @HaoyangLiu - https://github.com/cosmos/cosmos-sdk/pull/2238 will fix this, just as you suggest (the latter option): checking whether a validator is bonded or not before we tell Tendermint to remove it. We still want Tendermint to only deleted validators previously in the validator set because it serves as an additional sanity check on the SDK staking state machine.

Btw @HaoyangLiu, you can use the make localnet-start|stop command to easily create local testnets via docker-compose. All the ports are mapped to localhost. Once the network is running, you can then add nodes for your convenience.

Was able to reproduce this on a local testnet (4 nodes) by simply calling gaiacli stake redelegate begin a bunch of times until a node crashed.

Yes, I will paste logs shortly…and try to tackle this.

I’m thinking this should ideally be handled in the SDK state machine. I don’t think the second tx should be valid and make it through under these circumstances. Seems like we might need to take a look at Keeper#unbond and/or it’s caller, BeginRedelegation, in more detail.

Side question, is it valid to create a bunch of redelegation begin txs from the same src to the same dst without completing them @cwgoes @rigelrozanski?