cosmos-sdk: Add rollback support in the event of an incorrect hash
Problem Definition
The original issue can be found here. As a quick summary, in the event of a non-deterministic app hash or when an upgrade fails, Tendermint will have persisted the incorrect AppHash and nodes will be unable to make progress. What needs to happen is that the application should revert back to the previous state, Tendermint should also rollback to the previous state, then upon startup Tendermint can replay the last block and should now have the correct AppHash to continue.
Proposal
Work on the Tendermint side is underway here and will be backported to v0.34.14 when it is merged. It exposes a public function RollbackState which the SDK can use to provide the rollback tooling necessary.
cc @aaronc, @robert-zaremba, @ethanfrey
For Admin Use
- Not duplicate issue
- Appropriate labels applied
- Appropriate contributors tagged
- Contributor assigned/self-assigned
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 15
- Comments: 27 (16 by maintainers)
Commits related to this issue
- Implement rollback command Closes: #10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch — committed to yihuang/cosmos-sdk by yihuang 2 years ago
- Implement rollback command Closes: #10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch — committed to yihuang/cosmos-sdk by yihuang 2 years ago
- Implement rollback command (#11179) Closes: #10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch — committed to cosmos/cosmos-sdk by yihuang 2 years ago
- Implement rollback command (#11179) Closes: #10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad959269927e6de167f16b4d80789f1ce6c7... — committed to cosmos/cosmos-sdk by yihuang 2 years ago
- Implement rollback command (#11179) Closes: #10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad959269927e6de167f16b4d80789f1ce6c7... — committed to cosmos/cosmos-sdk by yihuang 2 years ago
- Implement rollback command (#11179) (#11314) Closes: #10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad959269927e6de167f... — committed to cosmos/cosmos-sdk by mergify[bot] 2 years ago
- Implement rollback command (#11179) (#11314) Closes: #10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad959269927e6de167f16b4d807... — committed to yihuang/cosmos-sdk by mergify[bot] 2 years ago
- Implement rollback command (#11179) (#11314) Closes: #10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad959269927e6de167f16b4d807... — committed to FunctionX/cosmos-sdk by mergify[bot] 2 years ago
- Implement rollback command (#11179) (#11314) Closes: #10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad959269927e6de167f... — committed to agoric-labs/cosmos-sdk by mergify[bot] 2 years ago
- Implement rollback command (#11179) (#11314) Closes: #10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad959269927e6de167f... — committed to Switcheo/cosmos-sdk by mergify[bot] 2 years ago
- Implement rollback command (#11179) (#11314) Closes: #10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad959269927e6de167f... — committed to Switcheo/cosmos-sdk by mergify[bot] 2 years ago
- Implement rollback command (#11179) (#11314) Closes: #10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad959269927e6de167f16b4d807... — committed to cheqd/cosmos-sdk by mergify[bot] 2 years ago
Oh, no. This is super important for recovery of chains from some non-determinism issues. I think high priority to finally get fixed.
So I investigated using the Tendermint rollback command in combination with modified application state to help the Thorchain team recover a halted network.
One thing the Tendermint rollback command doesn’t do is delete the block that was rolled back, this seems to cause the ABCI replay handshake to fail because it tries to apply the same block again.
it’ll be fixed by https://github.com/cosmos/cosmos-sdk/pull/11361
Just an update on the Tendermint side. We’ve merged changes to master and backported them to the respective branches. We’ll most likely release it in
v0.34.14next week andv0.35.0the following week (no promises 😃 ). The function you will want to call is this here: https://github.com/tendermint/tendermint/blob/f2a8f5e054cf99ebe246818bb6d71f41f9a30faa/cmd/tendermint/commands/rollback.go#L37Yup completely agree that we don’t need to expose this via RPC. The process shouldn’t be running when
rollbackis called (in fact the database will error when trying to open a second connection).It’s feasible to have a single command even for multi-process instances because no processes should be running. All you need is the tendermint
Configso the command knows where to find the database and it will perform the rollback.This should not be a publicly available rpc endpoint. And does not need to be run remotely. We do need to handle the multi-process use case however.
Something like
tendermint rollbackand thengaiad rollbackseems like it cover the multi-process scenario and the tendermint side is already implemented. Exposing this anywhere off of localhost seems like a huge security issueNB this has been requested by many chains over the last few years. And I built some tool back in 2018/2109 that did this, but never was maintained. AppHash mismatch is a pain to debug currently
We hit a non determinism issue somehow related to an ibcclientupdate tx. Leading to AppHash mismatch.
While trying to debug and actually get enough info to make a proper bug report, we were hampered by the lack of any tool to try one block earlier. And are syncing 510.000 blocks to try to see if we can reproduce it, not so fun.
I can also try to lend a hand for a fix on 0.42. As soon as the tendermint fix is in 0.34
Yup, although a minor clarification, Tendermint has its own state that tracks validator sets, consensus params and the app hash. We aren’t actually removing any blocks, only the Tendermint state at that height. When the node restarts it will replay the same transactions in the last block.
Hard for me to judge how important it is because I haven’t built an app before, but it was requested and it makes sense as a piece of tooling that is required. I think having it in v0.45 is fine. I don’t see why it also couldn’t be backported in the v0.44 range (we will be backporting it into v0.34 Tendermint).
Tendermint will have a public function (called
RollbackState- not sure which package it will be in yet) which takes in thetendermint/config.Configas an argument. This means that the app should just be able to call a single command in the Cosmos SDK i.e.simd state rollbackand it should rollback the app state to the previous height and then should call this Tendermint function to rollback Tendermint state.I thought about adding a
--heightflag so users could rollback to a certain height but I don’t believe it is that necessary. Only the last height is important. Every prior one, consensus had successfully agreed on the sameAppHash.