tendermint: Tendermint became unresponsive after some load

Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source): v0.25.0

ABCI app (name for built-in, URL for self-written if it’s publicly available): https://github.com/MinterTeam/minter-go-node

Environment:

  • OS (e.g. from /etc/os-release): bug is platform agnostic. Tested on MacOS 10.14, Debian 4.9.88-1+deb9u1, Ubuntu 18.04.1 LTS
  • Install tools: -
  • Others: -

What happened: After some time under load nodes stop responding. http://localhost:26657/status and some other rpc endpoints became not available with huge timeout (more than 60 secs). It seems like ConsensusState’s (or ConsensusReactor’s) mutex is deadlocked. Restarting node solves problem.

Strange thing is that this bug happens when block is committed and new block is not even started (no BeginBlock call to Application).

What you expected to happen: Tendermint should be working normally.

Have you tried the latest version: yes

How to reproduce it (as minimally and precisely as possible): Download Minter node, launch and synchronize it. Then send some transactions (50 txs in block will be sufficient).

Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file): There are no errors or warnings in logs, except that node starts to lose connection to other nodes after consensus stops.

Config (you can paste only the changes you’ve made): Default config

node command runtime flags: -

/dump_consensus_state output for consensus bugs: dump_consensus_state is unavailable by timeout 😦

Anything else we need to know: Tendermint is running in in-process mode. There are Local RPC calls to Tendermint. Bug happening only under load. I debugged our BeginBlock, EndBlock, Commit, DeliverTx implementations, they are finishing normally just before bug. It happens somewhere else. Also, we used v0.23.0 in our testnet before and everything was fine.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 93 (93 by maintainers)

Commits related to this issue

Most upvoted comments

Seems that issue was resolved! Thank you!

There are some interesting errors reported by our community. Dont know if they are related.

Dec 12 14:08:03 sentry4 minter[4352]: E[12126-12-12|14:08:03.017] Stopping peer for error                      module=p2p peer="Peer{MConn{144.76.140.208:14796} 83a56da94cd26aec30d3323e8bcae6ead18cf83d in}" err="Read overflow, maxSize is 1047 but this amino binary object is 6589 bytes."
Dec 12 13:48:30 sentry4 minter[4352]: E[12126-12-12|13:48:30.191] Error attempting to add vote                 module=consensus err="Expected 35373/1/2, but got 35373/0/2: Unexpected step"

PR with a fix/feature was merged to develop. Will be shipped with 0.26.2 release (check the changelog).

https://github.com/tendermint/tendermint/pull/2748#pullrequestreview-173704860 is totally correct.

I think that we all agree that we can close this issue.

Anyway, we still have issue with BroadcastTxCommit (which you mentioned in 3rd bullet of PR review). But I think that it is better to open a new issue.

Thank you again!

@ebuchman, that’s totally fine for us to switch to v0.26. We have plans to launch new testnet next week, so we will upgrade to v0.26.1 in new release of our project. Thank you!

The last thing to solve here: in /broadcast_commit we use pubsub to wait for the transaction result. When we have too many subscribers (say 10000), time to propagate a msg to the subscriber starts to grow. And more importantly, time to publish a msg starts to grow (because pubsub is a queue and publishing is synchronous) => consensus slows down! This will be addressed in pubsub 2.0 https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-033-pubsub.md.

Good news! The deal with /broadcast_commit is that it’s using pubsub to subscribe for every transaction result. And there’s an implicit limit of how many subscribers & events pubsub can handle. Note it’s better to use sync/async methods if you don’t need 100% confirmation. We can think about how to improve things, but I’d prefer to create a separate issue for that.

And I’ll ask to confirm that there’s no “freeze” anymore. Thank you for swift answers!

I am rushing a bit as always. It was incomplete. https://github.com/tendermint/tendermint/pull/2748/commits/f4f310903bea82c28e7320c42456922199f6c7d1 should work. How do I upgrade TM version? I’ve tried pointing dep to my branch, but then other nodes are on old version, so everything breaks.

@danil-lashin started some work here