tendermint: Consensus stuck for 50 minute trying to finish a block
Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source): 0.26.4
ABCI app (name for built-in, URL for self-written if it’s publicly available): Loom SDK
Environment:
- OS (e.g. from /etc/os-release): Ubuntu 16.04
- Install tools:
- Others:
What happened: Our 4 node staging tendermint cluster in a single geographic region will get stuck trying to finish a round, something for 50 minutes straight, it accepts tx in the mempool but never completes a block. Any ideas how to debug this more or allow the cluster to just cancel blocks when it can’t come to consensus.
See the attached logs from all 4 nodes. At the end of the 50 minutes it finally commits and 5000 txs stuck in the mempool clear
What you expected to happen:
Would expect the round to close within 5-10seconds, if they can’t get agreement from one node
Have you tried the latest version: yes/no no, the version is only a few days old
How to reproduce it (as minimally and precisely as possible): no consistent way, it seems to do with having dozens of users connected with a few transactions a second
Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file): Archive.zip Attached log from 4 nodes
Config (you can paste only the changes you’ve made):
replay = false
broadcast = true
create_empty_blocks = false
node command runtime flags:
/dump_consensus_state output for consensus bugs
This is an hour after the event, the cluster came back to life with no outside intervention. So I’m not sure if this is super relevant
Anything else we need to know:
A few hundred connected clients to the cluster but still low transaction volume, maybe a few transactions a second
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 24 (14 by maintainers)
Yeh this is a subtle point but important for a truly healthy mempool. Otherwise if a user sends the same tx to a node twice, it could get included twice in a block. Sounds like the second one will be considered invalid in the block, but better for it not to get there at all by implementing a more stateful CheckTx.
Ok, assuming this is due to some kind of DoS from the mempool overwhelming the consensus, there’s a few things that you could try immediately:
Priority: 5toPriority: 1in mempool/reactor.go (this will decrease the priority of mempool messages compared to consensus votes, as they are currently the same)We will merge #2778 soon as well after a bit more testing
Sorry just saw this. A combination of the 0.27.4 and tweaks to config.toml I can feel confident this is resolved