cosmos-sdk: Low level db error can cause app-hash mismatch

Summary of Bug

During delivering tx, when there are errors happening in low-level DB, for example, “too many opened files”, tm-db turns them into panic which is recovered by the tx runner and treated as a failed tx execution result, which results in state inconsistency and manifested as an app-hash mismatch error in the next block. Maybe it’s better to stop processing the current block in such cases.

Edit: A new case that we think may be related to this: https://github.com/cosmos/cosmos-sdk/issues/12012#issuecomment-1308209563

Version

Steps to Reproduce


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 4
  • Comments: 32 (9 by maintainers)

Commits related to this issue

Most upvoted comments

We’ll need think carefully for how to handle various DB-related errors when executing txs – in the meantime, I would advise bumping the file limit on your system as others have pointed out @Orion-9R 😃

Keen to find a solution - its a burden on our ecosystem to have to kick fullnodes regularly. Default is 1000000 for nodes and they are no where close to that so I think there is something else going on.

there could also be some genuine undeterministic logic, you can try to investigate the execution result of the block that produce the mismatched app hash.

Maybe not a bug, just an improvement, since the app-hash mismatch situation is much hard to recover from.

I’m not sure if there are other low level errors which could trigger the same issue.

panic in the abci event handlers will cause tendermint to output CONSENSUS FAILURE log and stop consensus state machine.

This code is exported, so it can be called by any consumer, not just ABCI event handlers. And panics that traverse exported API boundaries like this one have undefined behavior, callers can intercept them before they reach upper layers. So basically you can’t assume these things. If an implementation of KVStore.Set has an error, logging that error and continuing will put the store in an invalid state. Can’t do that. Panicking is preferable, though even that doesn’t get you reliable guarantees.

We’ll need think carefully for how to handle various DB-related errors when executing txs – in the meantime, I would advise bumping the file limit on your system as others have pointed out @Orion-9R 😃

Should we change tm-db’s apis to return error, and those errors should abort the node?

Hey guys I would not consider this a bug.

Instead, I would include directions on setting the number of open files value. In most Linux distributions the command to set them is this:

ulimit -n 500000

That will set it for the length of the session.

It can also be set in systemd units

and in the system file /etc/security/limits.conf