cosmos-sdk: Low level db error can cause app-hash mismatch
Summary of Bug
During delivering tx, when there are errors happening in low-level DB, for example, “too many opened files”, tm-db turns them into panic which is recovered by the tx runner and treated as a failed tx execution result, which results in state inconsistency and manifested as an app-hash mismatch error in the next block. Maybe it’s better to stop processing the current block in such cases.
Edit: A new case that we think may be related to this: https://github.com/cosmos/cosmos-sdk/issues/12012#issuecomment-1308209563
Version
Steps to Reproduce
For Admin Use
- Not duplicate issue
- Appropriate labels applied
- Appropriate contributors tagged
- Contributor assigned/self-assigned
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 4
- Comments: 32 (9 by maintainers)
Commits related to this issue
- add error log when iavl set failed Ref: #12012 — committed to yihuang/cosmos-sdk by yihuang 2 years ago
- refactor: add error log when iavl set failed (#13803) * add error log when iavl set failed Ref: #12012 * Update CHANGELOG.md * play safe — committed to cosmos/cosmos-sdk by yihuang 2 years ago
- refactor: add error log when iavl set failed (#13803) * add error log when iavl set failed Ref: #12012 * Update CHANGELOG.md * play safe (cherry picked from commit 22f326128559065b33606184bddcde9... — committed to cosmos/cosmos-sdk by yihuang 2 years ago
- refactor: add error log when iavl set failed (backport #13803) (#13804) * refactor: add error log when iavl set failed (#13803) * add error log when iavl set failed Ref: #12012 * Update CHAN... — committed to cosmos/cosmos-sdk by mergify[bot] 2 years ago
- Bring in changes from Cosmos-SDK v0.46.6 (#367) * docs: add ApplicationQueryService release notes (#13587) * docs: add ApplicationQueryService release notes * updates Co-authored-by: Julien ... — committed to provenance-io/cosmos-sdk by SpicyLemon 2 years ago
- Bring in changes from Cosmos-SDK v0.46.6 (#367) * docs: add ApplicationQueryService release notes (#13587) * docs: add ApplicationQueryService release notes * updates Co-authored-by: Julien Robert... — committed to provenance-io/cosmos-sdk by SpicyLemon 2 years ago
- Bring in v0.46.7 updates. (#402) * docs: add ApplicationQueryService release notes (#13587) * docs: add ApplicationQueryService release notes * updates Co-authored-by: Julien Robert <julien@... — committed to provenance-io/cosmos-sdk by SpicyLemon 2 years ago
- Bring in v0.46.7 updates. (#402) * docs: add ApplicationQueryService release notes (#13587) * docs: add ApplicationQueryService release notes * updates Co-authored-by: Julien Robert <julien@rbrt.f... — committed to provenance-io/cosmos-sdk by SpicyLemon 2 years ago
- feat: Adds ADR 038 - go plugin system (#404) * Revert "Merge pull request #270 from provenance-io/llama/add-fee-denom-change-proposal" This reverts commit 7462a84614cea9079d543949af566cd415916dfa,... — committed to provenance-io/cosmos-sdk by egaxhaj a year ago
there could also be some genuine undeterministic logic, you can try to investigate the execution result of the block that produce the mismatched app hash.
Maybe not a bug, just an improvement, since the app-hash mismatch situation is much hard to recover from.
I’m not sure if there are other low level errors which could trigger the same issue.
This code is exported, so it can be called by any consumer, not just ABCI event handlers. And panics that traverse exported API boundaries like this one have undefined behavior, callers can intercept them before they reach upper layers. So basically you can’t assume these things. If an implementation of KVStore.Set has an error, logging that error and continuing will put the store in an invalid state. Can’t do that. Panicking is preferable, though even that doesn’t get you reliable guarantees.
We’ll need think carefully for how to handle various DB-related errors when executing txs – in the meantime, I would advise bumping the file limit on your system as others have pointed out @Orion-9R 😃
Should we change tm-db’s apis to return error, and those errors should abort the node?
Hey guys I would not consider this a bug.
Instead, I would include directions on setting the number of open files value. In most Linux distributions the command to set them is this:
ulimit -n 500000
That will set it for the length of the session.
It can also be set in systemd units
and in the system file /etc/security/limits.conf