btcd: Bad sync peer stalls sync

I talked about this on IRC, so here is an overview.

I noticed that my node failed to catch up. After restarting, it would catch up for some time and stall again.

I noticed that when the syncing stalls, my sync peer is always this one:

  {                                                                                                                                                                                                                              
    "id": 21,                                                                                                                                                                                                                    
    "addr": "46.4.167.78:46826",                                                                                                                                                                                                 
    "addrlocal": "80.67.181.176:8333",                                                                                                                                                                                           
    "services": "00000005",                                                                                                                                                                                                      
    "relaytxes": true,                                                                                                                                                                                                           
    "lastsend": 1503432722,                                                                                                                                                                                                      
    "lastrecv": 1503432722,                                                                                                                                                                                                      
    "bytessent": 29326,                                                                                                                                                                                                          
    "bytesrecv": 798121,                                                                                                                                                                                                         
    "conntime": 1503431192,                                                                                                                                                                                                      
    "timeoffset": 0,                                                                                                                                                                                                             
    "pingtime": 56181,                                                                                                                                                                                                           
    "version": 70013,                                                                                                                                                                                                            
    "subver": "/btcwire:0.5.0/btcd:0.12.0/",                                                                                                                                                                                     
    "inbound": true,                                                                                                                                                                                                             
    "startingheight": 481628,                                                                                                                                                                                                    
    "currentheight": 481628,                                                                                                                                                                                                     
    "banscore": 0,                                                                                                                                                                                                               
    "feefilter": 0,                                                                                                                                                                                                              
    "syncnode": true                                                                                                                                                                                                             
  }

So after removing this peer, btcd picks another sync peer and syncing resumes. Until that peer goes offline and this same inbound btcd peer is chosen as sync peer and it stalls again.

So there are two problems here:

  • When there is no sync progress, we should chose another sync peer.
  • Do we favor some peers over others to be sync peers? (Like btcd ones f.e.?) Maybe that’s not a very good idea.

Just to be clear: I reproduced this several times.

  1. notice stall with btcd sync peer
  2. disconnect sync peer
  3. notice progress with other peer
  4. notice that same btcd node reconnects inbound
  5. disconnect sync peer
  6. notice stall from btcd sync peer

This is some part of debug log that includes the traffic right after the btcd sync peer is picked: https://gist.github.com/stevenroose/f60df52886c2dfdb71985a29cbcd80a6

This btcd node seems to be either malicious or without bandwidth to upload. It might be a bug in btcd, but I highly doubt that. But the fact that we keep favoring him to do sync is bad anyway.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 83 (23 by maintainers)

Commits related to this issue

Most upvoted comments

To workaround this, I ended up implementing a crude watchdog wrapper around btcd that monitors for potential stalls, disconnects the current syncnode if detected, and restarting btcd if too many stalls are detected. The gist is here (https://gist.github.com/jleskovar/dfc545148398d81715da02f61bf39b91) if it helps anyone. Requires the ‘jq’ package for parsing json

@2pac1 I can confirm that adding --nolisten has no effect on this issue. It still fails after a short period with just “lost peer” and “new valid peer” messages in the logs. So the btcd is still unusable. In the last 3 days of starting it every 20 minutes or so, I have managed to gain 1 in height.

It’s a massive change, and as usual in open source, finding individuals to properly review and test is a difficult task. As is, there are only a handful of individuals with knowledge of the code base that are able to properly review a change like that which has far reaching primary and secondary implications within the code base. I can’t give any ETA on the completion of the task at this point. A short term fix to this issue is to disconnect these peers by looking in getpeerinfo and removing the syncPeer.

I think the root culprit for this issue has been idenfitifed above which is extremely promising!

On Tue, Oct 30, 2018, 4:09 PM Josh Ellithorpe notifications@github.com wrote:

@Roasbeef https://github.com/Roasbeef any ETA on that? Also, for now can’t we just avoid selecting peers with a very high ping time? It doesn’t guarantee that the node is super fast, but if the ping time sucks, then it obviously will be slow…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/btcsuite/btcd/issues/1024#issuecomment-434503193, or mute the thread https://github.com/notifications/unsubscribe-auth/AA87LlImokChJMvbvXDpPMCshpxWyfacks5uqNw1gaJpZM4O_Ie4 .

I am going to try and trace this down. Seems like it is affecting everyone…

this really needs fixing asap

I have pushed a very rough update addressing several issues:

https://github.com/cubancorona/btcd/commit/716b7f70f385871e914ebc5af7aaac97334d1f03

When a Peer sends an outgoing message, it sends a message to its stallHandler() thread to tell it that there is an outgoing message. The stallHandler() determines whether it is a message that expects a response, and if the message does expect a response, the stallHandler() makes an entry in the pendingResponses map corresponding to the outgoing message for which a response is expected.

When the outgoing message is a getdata message, the message can be requesting a variety of different objects, including for example a block or a transaction. The potential problem is that the stallHandler() does not distinguish among getdata messages for different types of objects. So, when a getdata message is sent to a Peer, any subsequent incoming data from that peer (including a block message OR a transaction message) will satisfy the stallHandler()’s deadline for the getdata request.

As an example, a getdata message requesting a block from the current syncPeer would initiate a timer for the expected response. Because any incoming data will satisfy the deadline for the request, an incoming transaction message from the syncPeer would satisfy the deadline. In fact, this incoming message should not satisfy the deadline because the syncPeer should have responded with a block message and not a transaction message.

From peer.go:

	case wire.CmdGetData:
		// Expects a block, merkleblock, tx, or notfound message.
		pendingResponses[wire.CmdBlock] = deadline
		pendingResponses[wire.CmdMerkleBlock] = deadline
		pendingResponses[wire.CmdTx] = deadline
		pendingResponses[wire.CmdNotFound] = deadline

and

			case sccReceiveMessage:
				// Remove received messages from the expected
				// response map.  Since certain commands expect
				// one of a group of responses, remove
				// everything in the expected group accordingly.
				switch msgCmd := msg.message.Command(); msgCmd {
				case wire.CmdBlock:
					fallthrough
				case wire.CmdMerkleBlock:
					fallthrough
				case wire.CmdTx:
					fallthrough
				case wire.CmdNotFound:
					delete(pendingResponses, wire.CmdBlock)
					delete(pendingResponses, wire.CmdMerkleBlock)
					delete(pendingResponses, wire.CmdTx)
					delete(pendingResponses, wire.CmdNotFound)

I’m concerned that this patch will not resolve the issue. The patch relies on the stallHandler() detecting a syncPeer that is not making progress. If stallHandler() were already detecting such a stalled syncPeer, then the stallHandler() would already be disconnecting that peer. From the user posts above, it is apparent that stalled syncPeers remain connected.

So, the stallHandler() is not currently detecting a syncPeer that stalls in this manner.

Can anyone run in debug mode and post their debug logs when they stall? I think that would be useful for everyone. Then we can compare output.

This is really really really bad. How come, I ask myself … we are all running into the same issue from Aug 22, 2017, and the fix proposed from @pankajagarwal is just adding p.Disconnect() into the peer.go. I’m missing something?

What I did to speed things up was look at two or three different block explorers for the latest block hash and height, and then add a custom checkpoint. This makes your node skip tx validation up to that point.

Simply add addcheckpoint=<height>:<hash> to your config file or CLI.

This bug is critical, I’ve been struggling with manual restarts and IP banning for almost a week now.

I am completely synced up with testnet without intervening due to stalling.

@davecgh Is this project not active? Seems there are critical bugs which break base functionality which have gone without attention for years.

“btcd” is one of the “building blocks” advertised in the lightning network tutorial.

If “btcd” does not works out of the box, then newcomers will be scared away. They want to experience lightning network rather than debugging “btcd”!

Hope that lightningnetwork developer will spend more time on btcd. It needs to be rock solid and stable.

If this project is dead, please modify the lightning network tutorial and advise everyone to use “bitcoind” instead

The solution for slow peers is a full blown multi-peer sync. This is a very large change though, and we’ve been working to get some of the necessary refactorings and restructuring with btcd ready to take on the project.

Also “stallHandler” doesn’t solve the problem with slow peers. The best solution is probably to randomly pick a sync peer periodically from the outbound ones. You could be DoS’d by a peer just veeery slowly giving you blocks so you won’t go forward.

@davecgh What work has been done towards multi-peer syncing? What is the major hurdle?

I could potentially devote some time to this, as it’s a major UX problem.

@nutmix you can try my fork until the PR is accepted and merged. https://github.com/pankajagarwal/btcd

I have not restarted my process for last 3 days and it’s progressing well with the same pace.

It will be a good test but 1% in a week is tooo slow. Do you mind sharing some hardware stats on your setup ? What i have noticed is the syncup goes really fast until 250-300K’th block and after that it’s comes down to 6-10 blocks per 10 seconds with/or without checkpoint config. I have also validated that it’s not a network bandwidth issue of any sort.

@xelawafs You can’t copy block data from btcd to or from bitcoind. The database is different.

Even though I’m already getting outrageous bandwidth usage, if you need a sane btcd node with high bandwidth, you can use btcd.roose.io. (It’s most of the time at its limit in number of peers though, so it could possibly refuse to connect.)

@darwin The best work around for this issue was posted above by @jleskovar. I have used his script for three days and I receive close to continuous blocks.

https://gist.github.com/jleskovar/dfc545148398d81715da02f61bf39b91

Until a fix can be made: checkout the above.