tendermint: Consensus stuck for 50 minute trying to finish a block

Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source): 0.26.4

ABCI app (name for built-in, URL for self-written if it’s publicly available): Loom SDK

Environment:

  • OS (e.g. from /etc/os-release): Ubuntu 16.04
  • Install tools:
  • Others:

What happened: Our 4 node staging tendermint cluster in a single geographic region will get stuck trying to finish a round, something for 50 minutes straight, it accepts tx in the mempool but never completes a block. Any ideas how to debug this more or allow the cluster to just cancel blocks when it can’t come to consensus.

See the attached logs from all 4 nodes. At the end of the 50 minutes it finally commits and 5000 txs stuck in the mempool clear

What you expected to happen:

Would expect the round to close within 5-10seconds, if they can’t get agreement from one node

Have you tried the latest version: yes/no no, the version is only a few days old

How to reproduce it (as minimally and precisely as possible): no consistent way, it seems to do with having dozens of users connected with a few transactions a second

Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file): Archive.zip Attached log from 4 nodes

Config (you can paste only the changes you’ve made):

replay = false
broadcast = true
create_empty_blocks = false

node command runtime flags:

/dump_consensus_state output for consensus bugs This is an hour after the event, the cluster came back to life with no outside intervention. So I’m not sure if this is super relevant

consensus_dump.log

Anything else we need to know:

A few hundred connected clients to the cluster but still low transaction volume, maybe a few transactions a second

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 24 (14 by maintainers)

Most upvoted comments

Oh we wouldn’t stop multiple at same height in the checktx, I thought only delivertx mutated state to increment the nonce. Ok we will fix that

Yeh this is a subtle point but important for a truly healthy mempool. Otherwise if a user sends the same tx to a node twice, it could get included twice in a block. Sounds like the second one will be considered invalid in the block, but better for it not to get there at all by implementing a more stateful CheckTx.

Ok, assuming this is due to some kind of DoS from the mempool overwhelming the consensus, there’s a few things that you could try immediately:

  • fix from #3036 (this is now on latest develop. let us know if you need this back-ported, though depending on your situation you may be able to upgrade to the v0.27 series without restarting chains - see UPGRADING.md)
  • change Priority: 5 to Priority: 1 in mempool/reactor.go (this will decrease the priority of mempool messages compared to consensus votes, as they are currently the same)
  • implement replay protection in the app to prevent the same tx from being valid more than once (this should help take more pressure off the mempool)

We will merge #2778 soon as well after a bit more testing

Sorry just saw this. A combination of the 0.27.4 and tweaks to config.toml I can feel confident this is resolved