tendermint: Transactions are stuck in mempool and not getting propagated to other nodes

BUG REPORT

Tendermint version 0.22.0-fa05b039

ABCI app Custom Internal

Environment:

  • Ubuntu 64-bit
  • 4 Validator Nodes
  • Tx size ~ 80KB

What happened: After a not deterministic time period transactions which were successfully validated by CheckTx are not getting gossiped or deleted from mempool and the node cant be used anymore. What you expected to happen: After a tx is validated by CheckTx it should be gossiped to other nodes and the proposer node calls BeginBlock. How to reproduce it: We can reproduce this consistently without a deterministic time period with the setup described above

Config: empty blocks=false p2p-params as described in “running in production”

/dump_consensus_state output for consensus bugs consensus node 1: https://gist.github.com/yuomii/5acbd97d2935b7d1dd61d98e2ce5adb8 consensus node 2: https://gist.github.com/yuomii/ab11b2914753080b39cd3923b07f5d6d consensus node 3: https://gist.github.com/yuomii/23f74831f9031b213186c00d83fd35eb consensus node 4: https://gist.github.com/yuomii/215862af5e899092987b41852e13018c

Logs: Full Logs (log level debug): node 1: https://gist.github.com/yuomii/918ed5271a909613b37649f1b8d152f9 node 2: https://gist.github.com/yuomii/d80a783cdb088039e1cc78689917c697 node 3: https://gist.github.com/yuomii/96b0e482d697d46e27648d3e73c36ae8 node 4: https://gist.github.com/yuomii/2082b674c996386499144c819785a164

Seems to be related to #1875

EDIT: dump_consensus was called 2h later. After ~2h we noticed that blocks were beeing build again.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 1
  • Comments: 24 (20 by maintainers)

Commits related to this issue

Most upvoted comments

Merged to develop. Will be shipped with 0.22.5 or next breaking release.

My pleasure 😃

This issue can be closed. Looks like it’s fixed so far.

Ok, I’ve run our 4 validator node setup with the fixed version (backported locally to 0.22.4) and it is as stable as can be. We are going to run harder stress tests but it looks like we can close this issue on short notice. Thanks for the merge!

Would be great if we get this in 0.22.5 - if not, we can stay on my fork but also looking forward to your other changes/fixes.

Oh yeah. Look at that: I’ve added some logging around the channel Here while it’s working:

I[07-20|17:06:08.949] Executed block                               module=state height=155 validTxs=1 invalidTxs=0
I[07-20|17:06:08.950] Committed state                              module=state height=155 txs=1 appHash=E11101D0ED035BE392DD8AEB89FF320F8ADC0B8E
I[07-20|17:06:08.950] Recheck txs                                  module=mempool numtxs=0 height=155
I[07-20|17:06:08.951] Indexed block                                module=txindex height=155
I[07-20|17:06:09.043] WSJSONRPC                                    module=rpc-server protocol=websocket remote=127.0.0.1:65141 method=broadcast_tx_sync
I[07-20|17:06:09.043] Added good transaction                       module=mempool tx=6AD2D37DC1D130DC4F8B2FDA921841067D287F2A res="&{CheckTx:fee:<value:0 > }" total=1
I[07-20|17:06:09.043] writing txsAvailable with height             module=mempool height=156
I[07-20|17:06:09.043] tx available channel for height              module=consensus height=156
I[07-20|17:06:09.043] enterPropose(156/0). Current: 156/0/RoundStepNewHeight module=consensus height=156 round=0

You clearly see that the last height was finally commited, so the added tx was notified with the next heigth.

Now when it all stopped:

I[07-20|17:06:10.065] Executed block                               module=state height=157 validTxs=0 invalidTxs=0
I[07-20|17:06:10.065] Added good transaction                       module=mempool tx=99216F6CD1E273D3A18AA74D4360C383920FBA2F res="&{CheckTx:fee:<value:0 > }" total=1
I[07-20|17:06:10.065] writing txsAvailable with height             module=mempool height=157
I[07-20|17:06:10.065] WSJSONRPC                                    module=rpc-server protocol=websocket remote=127.0.0.1:65141 method=broadcast_tx_sync
I[07-20|17:06:10.066] Committed state                              module=state height=157 txs=0 appHash=7EB63D58748B07C861B67D87D384830F9BF6B971
I[07-20|17:06:10.066] Recheck txs                                  module=mempool numtxs=1 height=157
I[07-20|17:06:10.067] Done rechecking txs                          module=mempool 
I[07-20|17:06:10.067] Indexed block                                module=txindex height=157
I[07-20|17:06:10.067] tx available channel for height              module=consensus height=157
I[07-20|17:06:11.067] Timed out                                    module=consensus dur=995.766892ms height=158 round=0 step=RoundStepNewHeight
I[07-20|17:06:11.067] enterNewRound(158/0). Current: 158/0/RoundStepNewHeight module=consensus height=158 round=0
I[07-20|17:06:15.959] Ensure peers                                 module=p2p numOutPeers=0 numInPeers=0 numDialing=0 numToDial=10
I[07-20|17:06:15.959] No addresses to dial nor connected peers. Falling back to seeds module=p2p 
I[07-20|17:06:16.221] WSJSONRPC                                    module=rpc-server protocol=websocket remote=127.0.0.1:65141 method=status

Here you see that it was notified with the height that was currently in the process of being commited.

Strangely, re-check TX doesn’t seem to have any logic to trigger a new proposal. This is the one-node test case. Still need to grab debug logs.

I was helping @xla investigate this yesterday. We ran the following things on the docker localnet-start default testnet, except with create_empty_blocks = false, on both master and develop, and observed the same behavior.

There were 3 seperate bugs observed, only the 2nd bug is relevant to this issue.

  1. On doing tm-bench -v -T 10 192.167.10.2:26657, more than 10 blocks were created, implying that something is off with how its waiting to build blocks.
  2. Running
tm-bench -v -r 1000 -T 40 -s 80000 192.167.10.3:26657
<verbose output snipped>
D[07-19|17:20:30.634] 54 txs at block height 11                    
D[07-19|17:20:30.634] 15 txs at block height 10                    
D[07-19|17:20:30.634] 2 txs at block height 9                      
Stats          Avg       StdDev     Max     Total     
Txs/sec        2         9          54      71        
Blocks/sec     0.075     0.263      1       3 

caused the chain to halt. Each node would produce a ton of the following errors:

`node1    | I[07-19|17:11:22.361] Internal msg queue is full. Using a go-routine module=consensus` 

After a bout of those, there would then be a ton of these errors:

node2    | I[07-19|17:24:48.336] Received a block part when we're not expecting any module=consensus height=12 round=15 index=741 peer=b61582b8614b481743681f0d93d4dabefe529c5a

with the index number increasing between each message.

After the received block part messages ended, the round number would increment and the process would repeat. The number of unconfirmed transactions did not decrease. Flushing the mempool on the proposer made blocks begin to be produced again.

  1. Ending the cluster (via a single CTRL+C) and restarting it would cause no blocks to be produced due to a round regression error on each node. (Implies we’re not writing the round correctly somewhere. It was starting from round 0)