tendermint: Transactions are stuck in mempool and not getting propagated to other nodes
BUG REPORT
Tendermint version 0.22.0-fa05b039
ABCI app Custom Internal
Environment:
- Ubuntu 64-bit
- 4 Validator Nodes
- Tx size ~ 80KB
What happened: After a not deterministic time period transactions which were successfully validated by CheckTx are not getting gossiped or deleted from mempool and the node cant be used anymore. What you expected to happen: After a tx is validated by CheckTx it should be gossiped to other nodes and the proposer node calls BeginBlock. How to reproduce it: We can reproduce this consistently without a deterministic time period with the setup described above
Config: empty blocks=false p2p-params as described in “running in production”
/dump_consensus_state output for consensus bugs
consensus node 1:
https://gist.github.com/yuomii/5acbd97d2935b7d1dd61d98e2ce5adb8
consensus node 2:
https://gist.github.com/yuomii/ab11b2914753080b39cd3923b07f5d6d
consensus node 3:
https://gist.github.com/yuomii/23f74831f9031b213186c00d83fd35eb
consensus node 4:
https://gist.github.com/yuomii/215862af5e899092987b41852e13018c
Logs: Full Logs (log level debug): node 1: https://gist.github.com/yuomii/918ed5271a909613b37649f1b8d152f9 node 2: https://gist.github.com/yuomii/d80a783cdb088039e1cc78689917c697 node 3: https://gist.github.com/yuomii/96b0e482d697d46e27648d3e73c36ae8 node 4: https://gist.github.com/yuomii/2082b674c996386499144c819785a164
Seems to be related to #1875
EDIT: dump_consensus was called 2h later. After ~2h we noticed that blocks were beeing build again.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 1
- Comments: 24 (20 by maintainers)
Commits related to this issue
- #1920 try to fix race condition on proposal height for published txs - related to create_empty_blocks=false - published height for accepted tx can be wrong (too low) - use the actual mempool height +... — committed to srmo/tendermint by deleted user 6 years ago
- #1920 add initial test for mempool.Height() - not sure how to test the lock - can the mutex reference be of type Locker? -- this way, we can use a "mock" of the mutex to test triggering — committed to srmo/tendermint by deleted user 6 years ago
- #1920 use the ConsensusState height in favor of mempool - gets rid of indirections - doesn't need any "+1" magic — committed to srmo/tendermint by deleted user 6 years ago
- #1920 cosmetic - if we use cs.Height, it's enough to evaluate right before propose — committed to srmo/tendermint by deleted user 6 years ago
- #1920 cleanup TODO and non-needed code — committed to srmo/tendermint by deleted user 6 years ago
- #1920 try to fix race condition on proposal height for published txs - related to create_empty_blocks=false - published height for accepted tx can be wrong (too low) - use the actual mempool height +... — committed to srmo/tendermint by deleted user 6 years ago
- #1920 add initial test for mempool.Height() - not sure how to test the lock - can the mutex reference be of type Locker? -- this way, we can use a "mock" of the mutex to test triggering — committed to srmo/tendermint by deleted user 6 years ago
- #1920 use the ConsensusState height in favor of mempool - gets rid of indirections - doesn't need any "+1" magic — committed to srmo/tendermint by deleted user 6 years ago
- #1920 cosmetic - if we use cs.Height, it's enough to evaluate right before propose — committed to srmo/tendermint by deleted user 6 years ago
- #1920 cleanup TODO and non-needed code — committed to srmo/tendermint by deleted user 6 years ago
- #1920 add changelog entry — committed to srmo/tendermint by deleted user 6 years ago
- fix race condition on proposal height for published txs (#2021) * #1920 try to fix race condition on proposal height for published txs - related to create_empty_blocks=false - published height fo... — committed to tendermint/tendermint by srmo 6 years ago
- txAvailable is always true Refs #2021, #1920 — committed to tendermint/tendermint by melekes 6 years ago
- remove debug message No additional value. `enterPropose` log message should be enough. Refs #2021, #1920 — committed to tendermint/tendermint by melekes 6 years ago
- #2021 follow up (#2028) * update changelog * txAvailable is always true Refs #2021, #1920 * remove debug message No additional value. `enterPropose` log message should be enough. Refs ... — committed to tendermint/tendermint by melekes 6 years ago
Merged to develop. Will be shipped with 0.22.5 or next breaking release.
My pleasure 😃
This issue can be closed. Looks like it’s fixed so far.
Ok, I’ve run our 4 validator node setup with the fixed version (backported locally to 0.22.4) and it is as stable as can be. We are going to run harder stress tests but it looks like we can close this issue on short notice. Thanks for the merge!
Would be great if we get this in 0.22.5 - if not, we can stay on my fork but also looking forward to your other changes/fixes.
Oh yeah. Look at that: I’ve added some logging around the channel Here while it’s working:
You clearly see that the last height was finally commited, so the added tx was notified with the next heigth.
Now when it all stopped:
Here you see that it was notified with the height that was currently in the process of being commited.
Strangely, re-check TX doesn’t seem to have any logic to trigger a new proposal. This is the one-node test case. Still need to grab debug logs.
I was helping @xla investigate this yesterday. We ran the following things on the
docker localnet-startdefault testnet, except withcreate_empty_blocks = false, on both master and develop, and observed the same behavior.There were 3 seperate bugs observed, only the 2nd bug is relevant to this issue.
tm-bench -v -T 10 192.167.10.2:26657, more than 10 blocks were created, implying that something is off with how its waiting to build blocks.caused the chain to halt. Each node would produce a ton of the following errors:
After a bout of those, there would then be a ton of these errors:
with the index number increasing between each message.
After the received block part messages ended, the round number would increment and the process would repeat. The number of unconfirmed transactions did not decrease. Flushing the mempool on the proposer made blocks begin to be produced again.