tendermint: Tendermint became unresponsive after some load
Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source): v0.25.0
ABCI app (name for built-in, URL for self-written if it’s publicly available): https://github.com/MinterTeam/minter-go-node
Environment:
- OS (e.g. from /etc/os-release): bug is platform agnostic. Tested on MacOS 10.14, Debian 4.9.88-1+deb9u1, Ubuntu 18.04.1 LTS
- Install tools: -
- Others: -
What happened: After some time under load nodes stop responding. http://localhost:26657/status and some other rpc endpoints became not available with huge timeout (more than 60 secs). It seems like ConsensusState’s (or ConsensusReactor’s) mutex is deadlocked. Restarting node solves problem.
Strange thing is that this bug happens when block is committed and new block is not even started (no BeginBlock call to Application).
What you expected to happen: Tendermint should be working normally.
Have you tried the latest version: yes
How to reproduce it (as minimally and precisely as possible): Download Minter node, launch and synchronize it. Then send some transactions (50 txs in block will be sufficient).
Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file): There are no errors or warnings in logs, except that node starts to lose connection to other nodes after consensus stops.
Config (you can paste only the changes you’ve made): Default config
node command runtime flags: -
/dump_consensus_state output for consensus bugs: dump_consensus_state is unavailable by timeout 😦
Anything else we need to know: Tendermint is running in in-process mode. There are Local RPC calls to Tendermint. Bug happening only under load. I debugged our BeginBlock, EndBlock, Commit, DeliverTx implementations, they are finishing normally just before bug. It happens somewhere else. Also, we used v0.23.0 in our testnet before and everything was fine.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 93 (93 by maintainers)
Commits related to this issue
- use READ lock/unlock in ConsensusState#GetLastHeight Refs #2721 — committed to tendermint/tendermint by melekes 6 years ago
- fix peer formatting (output its address instead of the pointer) ``` [54310]: E[11-02|11:59:39.851] Connection failed @ sendRoutine module=p2p peer=0xb78f00 conn=MConn{74.207.236.148:2665... — committed to tendermint/tendermint by melekes 6 years ago
- panic if peer has no state https://github.com/tendermint/tendermint/issues/2721#issuecomment-435347165 It's confusing that sometimes we check if peer has a state, but most of the times we expect it ... — committed to tendermint/tendermint by melekes 6 years ago
- abci/localclient: extend lock on app callback App callback should be protected by lock as well (note this was already done for InitChainAsync, why not for others???). Otherwise, when we execute the b... — committed to tendermint/tendermint by melekes 6 years ago
- use READ lock/unlock in ConsensusState#GetLastHeight Refs #2721 — committed to tendermint/tendermint by melekes 6 years ago
- fix peer formatting (output its address instead of the pointer) ``` [54310]: E[11-02|11:59:39.851] Connection failed @ sendRoutine module=p2p peer=0xb78f00 conn=MConn{74.207.236.148:2665... — committed to tendermint/tendermint by melekes 6 years ago
- panic if peer has no state https://github.com/tendermint/tendermint/issues/2721#issuecomment-435347165 It's confusing that sometimes we check if peer has a state, but most of the times we expect it ... — committed to tendermint/tendermint by melekes 6 years ago
- abci/localclient: extend lock on app callback App callback should be protected by lock as well (note this was already done for InitChainAsync, why not for others???). Otherwise, when we execute the b... — committed to tendermint/tendermint by melekes 6 years ago
- abci: localClient improvements & bugfixes & pubsub Unsubscribe issues (#2748) * use READ lock/unlock in ConsensusState#GetLastHeight Refs #2721 * do not use defers when there's no need * fix... — committed to tendermint/tendermint by melekes 6 years ago
- abci: localClient improvements & bugfixes & pubsub Unsubscribe issues (#2748) * use READ lock/unlock in ConsensusState#GetLastHeight Refs #2721 * do not use defers when there's no need * fix... — committed to kfangw/blockchain by melekes 6 years ago
- abci: localClient improvements & bugfixes & pubsub Unsubscribe issues (#2748) * use READ lock/unlock in ConsensusState#GetLastHeight Refs #2721 * do not use defers when there's no need * fix... — committed to daotl/go-acei by melekes 6 years ago
Seems that issue was resolved! Thank you!
There are some interesting errors reported by our community. Dont know if they are related.
PR with a fix/feature was merged to develop. Will be shipped with 0.26.2 release (check the changelog).
https://github.com/tendermint/tendermint/pull/2748#pullrequestreview-173704860 is totally correct.
I think that we all agree that we can close this issue.
Anyway, we still have issue with
BroadcastTxCommit(which you mentioned in 3rd bullet of PR review). But I think that it is better to open a new issue.Thank you again!
@ebuchman, that’s totally fine for us to switch to v0.26. We have plans to launch new testnet next week, so we will upgrade to v0.26.1 in new release of our project. Thank you!
The last thing to solve here: in
/broadcast_commitwe use pubsub to wait for the transaction result. When we have too many subscribers (say 10000), time to propagate a msg to the subscriber starts to grow. And more importantly, time to publish a msg starts to grow (because pubsub is a queue and publishing is synchronous) => consensus slows down! This will be addressed in pubsub 2.0 https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-033-pubsub.md.Good news! The deal with
/broadcast_commitis that it’s using pubsub to subscribe for every transaction result. And there’s an implicit limit of how many subscribers & events pubsub can handle. Note it’s better to use sync/async methods if you don’t need 100% confirmation. We can think about how to improve things, but I’d prefer to create a separate issue for that.And I’ll ask to confirm that there’s no “freeze” anymore. Thank you for swift answers!
I am rushing a bit as always. It was incomplete. https://github.com/tendermint/tendermint/pull/2748/commits/f4f310903bea82c28e7320c42456922199f6c7d1 should work. How do I upgrade TM version? I’ve tried pointing dep to my branch, but then other nodes are on old version, so everything breaks.
@danil-lashin started some work here