cosmos-sdk: Add rollback support in the event of an incorrect hash

Problem Definition

The original issue can be found here. As a quick summary, in the event of a non-deterministic app hash or when an upgrade fails, Tendermint will have persisted the incorrect AppHash and nodes will be unable to make progress. What needs to happen is that the application should revert back to the previous state, Tendermint should also rollback to the previous state, then upon startup Tendermint can replay the last block and should now have the correct AppHash to continue.

Proposal

Work on the Tendermint side is underway here and will be backported to v0.34.14 when it is merged. It exposes a public function RollbackState which the SDK can use to provide the rollback tooling necessary.

cc @aaronc, @robert-zaremba, @ethanfrey


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 15
  • Comments: 27 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Reopening as it seems this isn’t fixed yet

Oh, no. This is super important for recovery of chains from some non-determinism issues. I think high priority to finally get fixed.

So I investigated using the Tendermint rollback command in combination with modified application state to help the Thorchain team recover a halted network.

One thing the Tendermint rollback command doesn’t do is delete the block that was rolled back, this seems to cause the ABCI replay handshake to fail because it tries to apply the same block again.

Just an update on the Tendermint side. We’ve merged changes to master and backported them to the respective branches. We’ll most likely release it in v0.34.14 next week and v0.35.0 the following week (no promises 😃 ). The function you will want to call is this here: https://github.com/tendermint/tendermint/blob/f2a8f5e054cf99ebe246818bb6d71f41f9a30faa/cmd/tendermint/commands/rollback.go#L37

Yup completely agree that we don’t need to expose this via RPC. The process shouldn’t be running when rollback is called (in fact the database will error when trying to open a second connection).

It’s feasible to have a single command even for multi-process instances because no processes should be running. All you need is the tendermint Config so the command knows where to find the database and it will perform the rollback.

This should not be a publicly available rpc endpoint. And does not need to be run remotely. We do need to handle the multi-process use case however.

Something like tendermint rollback and then gaiad rollback seems like it cover the multi-process scenario and the tendermint side is already implemented. Exposing this anywhere off of localhost seems like a huge security issue

NB this has been requested by many chains over the last few years. And I built some tool back in 2018/2109 that did this, but never was maintained. AppHash mismatch is a pain to debug currently

We hit a non determinism issue somehow related to an ibcclientupdate tx. Leading to AppHash mismatch.

While trying to debug and actually get enough info to make a proper bug report, we were hampered by the lack of any tool to try one block earlier. And are syncing 510.000 blocks to try to see if we can reproduce it, not so fun.

I can also try to lend a hand for a fix on 0.42. As soon as the tendermint fix is in 0.34

Do I get it right, @cmwaters ?

Yup, although a minor clarification, Tendermint has its own state that tracks validator sets, consensus params and the app hash. We aren’t actually removing any blocks, only the Tendermint state at that height. When the node restarts it will replay the same transactions in the last block.

How important it is? Is it fine to roll it with v0.45?

Hard for me to judge how important it is because I haven’t built an app before, but it was requested and it makes sense as a piece of tooling that is required. I think having it in v0.45 is fine. I don’t see why it also couldn’t be backported in the v0.44 range (we will be backporting it into v0.34 Tendermint).

How about this:

  • App will expose a command to rollback a state (eg: simd state rollback x where x = number of blocks)
  • App will call tendermint to do the same

Tendermint will have a public function (called RollbackState - not sure which package it will be in yet) which takes in the tendermint/config.Config as an argument. This means that the app should just be able to call a single command in the Cosmos SDK i.e. simd state rollback and it should rollback the app state to the previous height and then should call this Tendermint function to rollback Tendermint state.

I thought about adding a --height flag so users could rollback to a certain height but I don’t believe it is that necessary. Only the last height is important. Every prior one, consensus had successfully agreed on the same AppHash.