gaia: Node randomly stops syncing, after restart it's fine (for some time)
Summary of Bug
I’m running a cosmos node and occasionally (now at least once a day) it just stops syncing, in the logs i can see some
2:11PM ERR Connection failed @ sendRoutine conn={"Logger":{}} err="pong timeout" module=p2p peer={"id":"5dc6a28f2caff8e61c47c1c9b658e7b1ea5fbfd9","ip":"5.9.42.116","port":26656}
and
2:11PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
It doesn’t recover by itself, the only way to get it back synced is to restart it (the container)
EDIT: restart doesn’t always immediately help, i get the same logs for the connections
i also just tried with a newly downloaded addrbook.json
Version
v7.1.0
Steps to Reproduce
i’m just running a node with gaiad start --x-crisis-skip-assert-invariants
For Admin Use
- Not duplicate issue
- Appropriate labels applied
- Appropriate contributors tagged
- Contributor assigned/self-assigned
- Is a spike necessary to map out how the issue should be approached?
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 26 (7 by maintainers)
@bb4L this is the conclusion that we ended with, that theres an interplay with network traffic and the node performance. This is a tendermint / comet level issue, that we think has been addressed in versions after v8. Currently, v8 / v9 are not supported in production, only for archive related issues, therefore closing this issue. for future versions, we will ask the Comet team to include longer term tests with heavy rpc / rest loads to confirm that there is no regression and that the performance characteristics are understood.
We have the same problem when running chihuahuad (based on cosmos SDK). The problem ocurs for us only when REST API enabled and some application tries to download all accounts using endpoint: “cosmos/auth/v1beta1/accounts” (paginated). At this moment in our node logs we can see this output: May 04 09:01:23 chihuahua chihuahuad[674]: 9:01AM ERR Connection failed @ sendRoutine conn={“Logger”:{}} err=“pong timeout” module=p2p peer={“id”:“28c227d31064e4bacb366055d796f0c3064c1db0”,“ip”:“149.202.72.186”,“port”:26613} May 04 09:01:26 chihuahua chihuahuad[674]: 9:01AM INF service stop impl={“Logger”:{}} module=p2p msg={} peer={“id”:“28c227d31064e4bacb366055d796f0c3064c1db0”,“ip”:“149.202.72.186”,“port”:26613} May 04 09:01:27 chihuahua chihuahuad[674]: 9:01AM ERR Stopping peer for error err=“pong timeout” module=p2p peer={“Data”:{},“Logger”:{}} May 04 09:01:30 chihuahua chihuahuad[674]: 9:01AM INF service stop impl={“Data”:{},“Logger”:{}} module=p2p msg={} peer={“id”:“28c227d31064e4bacb366055d796f0c3064c1db0”,“ip”:“149.202.72.186”,“port”:26613} May 04 09:01:44 chihuahua systemd[1]: node.service: Main process exited, code=killed, status=9/KILL May 04 09:01:44 chihuahua systemd[1]: node.service: Failed with result ‘signal’.
Disabling API is fully solving the problem. Upscaling VPS to from 4\8 to 16 cores\64GB RAM not solving the problem. This issue for all COSMOS SDK projects. Seems that the issue may be closed.
I can assure you that I have noticed this same issue on other Cosmos SDK chains (Secret and Terra2) several times. This is not gaia specific, there is something else upstream. It started happening a couple of months ago. I am sorry I have not been able to narrow it down aside from time and chains in which we’ve seen this exact issue happening.
can’t tell since it’s happening without me doing anything… / without having a high rpc load
for me the effect is also on nodes which aren’t used by applications (so it can’t be a only load related issue)
cpu / memory looks fine on my instance(s)
@mmulji-ic thanks for the information, let me know if you need something from my side
@adizere
gaia version: 7.1.0, 8.0.1 as well as 9.0.0 (as written in the issue/other comments)
config.toml has no
peerssection (at least mine hasn’t) output ofcat config.toml | grep peer:minimal way to reproduce, i guess just try to run a node 🤷🏽♂️
@adizere would you recommend doing a tendermint debug dump?
We also ran into this issue – our node would never make it 24 hours without halting syncing.
We resolved it by switching to rocksdb, using the address book at https://polkachu.com/addrbooks/cosmos, and increasing the number of outbound peers to 200. The node has now been stable for 4+ days.