cosmos-sdk: Low level db error can cause app-hash mismatch

Summary of Bug

During delivering tx, when there are errors happening in low-level DB, for example, “too many opened files”, tm-db turns them into panic which is recovered by the tx runner and treated as a failed tx execution result, which results in state inconsistency and manifested as an app-hash mismatch error in the next block. Maybe it’s better to stop processing the current block in such cases.

Edit: A new case that we think may be related to this: https://github.com/cosmos/cosmos-sdk/issues/12012#issuecomment-1308209563

Version

Steps to Reproduce

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate contributors tagged
Contributor assigned/self-assigned

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 4
Comments: 32 (9 by maintainers)

Commits related to this issue

add error log when iavl set failed Ref: #12012 — committed to yihuang/cosmos-sdk by yihuang 2 years ago
refactor: add error log when iavl set failed (#13803) * add error log when iavl set failed Ref: #12012 * Update CHANGELOG.md * play safe — committed to cosmos/cosmos-sdk by yihuang 2 years ago
refactor: add error log when iavl set failed (#13803) * add error log when iavl set failed Ref: #12012 * Update CHANGELOG.md * play safe (cherry picked from commit 22f326128559065b33606184bddcde9... — committed to cosmos/cosmos-sdk by yihuang 2 years ago
refactor: add error log when iavl set failed (backport #13803) (#13804) * refactor: add error log when iavl set failed (#13803) * add error log when iavl set failed Ref: #12012 * Update CHAN... — committed to cosmos/cosmos-sdk by mergify[bot] 2 years ago
Bring in changes from Cosmos-SDK v0.46.6 (#367) * docs: add ApplicationQueryService release notes (#13587) * docs: add ApplicationQueryService release notes * updates Co-authored-by: Julien ... — committed to provenance-io/cosmos-sdk by SpicyLemon 2 years ago
Bring in changes from Cosmos-SDK v0.46.6 (#367) * docs: add ApplicationQueryService release notes (#13587) * docs: add ApplicationQueryService release notes * updates Co-authored-by: Julien Robert... — committed to provenance-io/cosmos-sdk by SpicyLemon 2 years ago
Bring in v0.46.7 updates. (#402) * docs: add ApplicationQueryService release notes (#13587) * docs: add ApplicationQueryService release notes * updates Co-authored-by: Julien Robert <julien@... — committed to provenance-io/cosmos-sdk by SpicyLemon 2 years ago
Bring in v0.46.7 updates. (#402) * docs: add ApplicationQueryService release notes (#13587) * docs: add ApplicationQueryService release notes * updates Co-authored-by: Julien Robert <julien@rbrt.f... — committed to provenance-io/cosmos-sdk by SpicyLemon 2 years ago
feat: Adds ADR 038 - go plugin system (#404) * Revert "Merge pull request #270 from provenance-io/llama/add-fee-denom-change-proposal" This reverts commit 7462a84614cea9079d543949af566cd415916dfa,... — committed to provenance-io/cosmos-sdk by egaxhaj a year ago

Most upvoted comments

We’ll need think carefully for how to handle various DB-related errors when executing txs – in the meantime, I would advise bumping the file limit on your system as others have pointed out @Orion-9R 😃

Keen to find a solution - its a burden on our ecosystem to have to kick fullnodes regularly. Default is 1000000 for nodes and they are no where close to that so I think there is something else going on.

there could also be some genuine undeterministic logic, you can try to investigate the execution result of the block that produce the mismatched app hash.

yihuang on Jun 18, 2022

Maybe not a bug, just an improvement, since the app-hash mismatch situation is much hard to recover from.

I’m not sure if there are other low level errors which could trigger the same issue.

yihuang on May 23, 2022

panic in the abci event handlers will cause tendermint to output CONSENSUS FAILURE log and stop consensus state machine.

This code is exported, so it can be called by any consumer, not just ABCI event handlers. And panics that traverse exported API boundaries like this one have undefined behavior, callers can intercept them before they reach upper layers. So basically you can’t assume these things. If an implementation of KVStore.Set has an error, logging that error and continuing will put the store in an invalid state. Can’t do that. Panicking is preferable, though even that doesn’t get you reliable guarantees.

peterbourgon on Nov 10, 2022

We’ll need think carefully for how to handle various DB-related errors when executing txs – in the meantime, I would advise bumping the file limit on your system as others have pointed out @Orion-9R 😃

alexanderbez on Jun 17, 2022

Should we change tm-db’s apis to return error, and those errors should abort the node?

yihuang on May 25, 2022

Hey guys I would not consider this a bug.

Instead, I would include directions on setting the number of open files value. In most Linux distributions the command to set them is this:

ulimit -n 500000

That will set it for the length of the session.

It can also be set in systemd units

and in the system file /etc/security/limits.conf

faddat on May 23, 2022