lnd: [bug]: LND frequently loses sync to chain with pruned backend

Background

I run LND compiled from source on a Ubuntu VPS. I use Bitcoin Core as my backend, pruned to 100GB (prune=100000). I’ve noticed on various occasions that LND loses its sync to chain (“synced_to_chain”: false,`) and does not recover by itself. A restart always fixes the issue promptly, although in some cases LND cannot be gracefully shut down. I have various peers with the same issue, all are running a very similar setup. The issue seems to be related to pruning the node. The more aggressively you prune the node, the more often LND will lose sync to the chain.

Your environment

  • 0.17.1-beta commit=lightning-terminal-v0.12.1-alpha-dirty
  • Linux 5.4.0-167-generic
  • Bitcoin Core Version: 240100
  • prune=100000

lnd.tor.active=true lnd.tor.skip-proxy-for-clearnet-targets=1

Steps to reproduce

Tell us how to reproduce this issue. Please provide stacktraces and links to code in question.

Expected behaviour

I expect LND to stay synced to chain as long as the backend is synced to chain.

Actual behaviour

LND occasionally loses sync to chain and does not recover while Bitcoin Core is running and synced.

I have about 1.5h of logs and a go profile attached below.

04lnd.log 03lnd.log 02lnd.log 01lnd.log goprofile.txt

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Reactions: 2
  • Comments: 44 (21 by maintainers)

Most upvoted comments

Ok I was able to track down the issue, will provide a fix shortly.

So the problem is here:

When a block is not on our local backend (we already pruned it) we will try to fetch it from other nodes here:

https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/bitcoind_conn.go#L474

Now the problem:

We create a new block request here:

https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L504-L519

Here comes the problem, first of all the 2 seconds and then 4 seconds for the second try are too short which is the root cause why we are having a lot of timeouts and the block is delivered couple of seconds later.

When we have two timeouts we will never delete the block from the d.blocksQueried this only happens if we receive the block in time:

https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L650

So this means the problem arises when the same block is requested a second time (this might take a while to occur) but then we will think there is already a request going on in the dispatcher see here:

https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L544-L553

So this will never launch a new request and the GetBlock call will hang forever, blocking the ValidationBarrier eventually.

So I think there are several things which can be fixed immediately, will come up with a plan tomorrow.

EDIT: Huge thanks to @M1ch43lV for running the draft PR to extract the logs which lead to the finding of the bug 🙏

Moreover while trying to reproduce the bug I see that we are favouring tor inbound connection over inbound clearnet connection. (https://github.com/btcsuite/btcwallet/blob/master/chain/pruned_block_dispatcher.go#L281)

We poll and connect to peers, the problem is the following, we use the getpeerinfo cmd and then use the addr tag of its response.

Now comes the problem, bitcoind will list the outgoing port of the connection for inbound clearnet connections:

  {
    "id": 981270,
    "addr": "24.144.98.209:53546",
    "addrbind": "10.0.0.3:8333",
    "network": "ipv4",
    "services": "000000000000040d",
    "servicesnames": [
      "NETWORK",
      "BLOOM",
      "WITNESS",
      "NETWORK_LIMITED"
    ],
...

Don’t be confused about the bind addr (I am running behind a vpn).

But as we see the addr field will list the outgoing port for the connection. Now we will try to connect to this connection and will fail obviously because this is not the listening port of the peer. So we will neglect every inbound clearnet peer and will only be able to connect to outbound peers or inbound tor peers. That explains why the example here had only tor peers during the retries of the block fetching. Normally bitcoind has only 8 outbound peers and depending on the setting way more inbound peers. So the probability that we end up with tor only nodes is very high.

So I think we should sort the peer connection by outbound/inbound first and then first try all the outbound peers and only after that try inbound tor peers. I am not sure whether we can fetch the listening port via a bitcoind rpc command, I searched but could not find anything.

I think batch is nil because the entry gets retried and added to currentQueries but the batch timer is hit which deletes the batch from currentBatches: https://github.com/lightninglabs/neutrino/blob/42a196facefac36282e68ddf4a02c9ce4601a0b0/query/workmanager.go#L412-L426. I think this could happen if the node added the batch here https://github.com/lightninglabs/neutrino/blob/42a196facefac36282e68ddf4a02c9ce4601a0b0/query/workmanager.go#L449-L455 (starting the batch timer) and had no peers or peers that were slow to do the version-verack handshake.

The crux of the issue seems to be that if we have slow peers (e.g. tor) or peers without the block, we don’t attempt to rotate out the connections in favor of new ones. We do call Punish but that only affects the ordering of work queue items and doesn’t cause a disconnect. While https://github.com/lightningnetwork/lnd/pull/8151 fixed one specific case of a shutdown hang, it seems like there could be other places where GetBlock calls block (no pun intended lol) so it wasn’t a complete fix.

While we’re at it we could probably also bump up the 2 second timeout because tor nodes have higher latency and we don’t want to fail getting a block.

Analysing this a bit more I think I found the problem:

The ChannelRouter initializes the Barrier with 4*NumberofCore Slots: for the tunnelsats node: proc = 4 its 16

https://github.com/lightningnetwork/lnd/blob/master/routing/router.go#L1145

Now the pprof of the time when the problem happens count exacty 16 stuck goroutines using up 16 slots hence we are in a deadlock somehow in terms of advancing the blockheight for the router:

The stuck routines are 2 here:

goroutine 5730594 [select, 1467 minutes]: 2 times: [[5730594, 5626273]
github.com/btcsuite/btcwallet/chain.(*BitcoindConn).GetBlock(0xc00017c790, 0xc00f8967e0)
        github.com/btcsuite/btcwallet@v0.16.10-0.20231017144732-e3ff37491e9c/chain/bitcoind_conn.go:479 +0x1d1
github.com/btcsuite/btcwallet/chain.(*BitcoindClient).GetBlock(0xc013eeaf50?, 0x0?)
        github.com/btcsuite/btcwallet@v0.16.10-0.20231017144732-e3ff37491e9c/chain/bitcoind_client.go:144 +0x17
github.com/lightningnetwork/lnd/blockcache.(*BlockCache).GetBlock(0xc006255970, 0xc00f8967e0, 0xc000bd3ac8)
        github.com/lightningnetwork/lnd/blockcache/blockcache.go:55 +0x22a
github.com/lightningnetwork/lnd/lnwallet/btcwallet.(*BtcWallet).GetBlock(0x100c007e22b78?, 0x7f87c4635958?)
        github.com/lightningnetwork/lnd/lnwallet/btcwallet/blockchain.go:139 +0x1de
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTx(0xc000a44280, 0xc00a4052f4)
        github.com/lightningnetwork/lnd/routing/router.go:1780 +0x5c
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).processUpdate(0xc000a44280, {0x1c8e960?, 0xc012c62f20?}, {0xc006381cb0, 0x1, 0x1})
        github.com/lightningnetwork/lnd/routing/router.go:1555 +0xea7
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).handleNetworkUpdate(0xc000a44280, 0xc0103a5470, 0xc00e7f0930)
        github.com/lightningnetwork/lnd/routing/router.go:1031 +0x10d
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler in goroutine 32469
        github.com/lightningnetwork/lnd/routing/router.go:1129 +0xbbf

and 14 times on the WaitForDependants

goroutine 5900726 [select, 1377 minutes]: 14 times: [[5900726, 7110031, 5908580, 6245797, 6464543, 7261552, 5911043, 5789756, 5785513, 5910633, 6017184, 6013681, 7107010, 6857809]
github.com/lightningnetwork/lnd/routing.(*ValidationBarrier).WaitForDependants(0xc0103a5470, {0x1c29420?, 0xc010d4cb60?})
        github.com/lightningnetwork/lnd/routing/validation_barrier.go:238 +0x605
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).handleNetworkUpdate(0xc000a44280, 0xc0103a5470, 0xc010826750)
        github.com/lightningnetwork/lnd/routing/router.go:1010 +0xb0
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler in goroutine 32469
        github.com/lightningnetwork/lnd/routing/router.go:1129 +0xbbf

So not sure why the GetBlock call in the first cited goroutine hangs, this might be pruned specific but I thing this behaviour an also happen without a prunded not in place?

Wdyt @Crypt-iQ ?

Here comes the problem, first of all the 2 seconds and then 4 seconds for the second try are too short which is the root cause why we are having a lot of timeouts and the block is delivered couple of seconds later.

I mention a route here to dynamically tune the timeout, which shouldn’t be too involved: https://github.com/lightninglabs/neutrino/issues/292

As a starting point, we can use the reported ping latency from the getpeerinfo call. Then we can apply an RTT multiplier on top of a processing multiplier. Each time we get a response, we use that to tune what the actual timeout should be (some sort of moving weighted average).

Perhaps what we also need is a callback for if the request times out and we’re unable get a response? That would then allow us to remove it from the set of in flight requests.

IIUC, upon an error, we’ll get a send on the errChan, but currently we just return early, whereas we could update the set of inflight requests there: https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/bitcoind_conn.go#L483-L490

The error shuold come from here once the max number of retries is exceeded: https://github.com/lightninglabs/neutrino/blob/42a196facefac36282e68ddf4a02c9ce4601a0b0/query/workmanager.go#L341-L358

So I think we should sort the peer connection by outbound/inbound first and then first try all the outbound peers and only after that try inbound tor peers. I am not sure whether we can fetch the listening port via a bitcoind rpc command, I searched but could not find anything.

I think we can use the getnodeaddresses command which fetches from bitcoind’s address manager. I think anything in the address manager should have the listening port rather than the randomly assigned TCP port

EDIT: oops, realize we do this. But I think we can use this over the getpeerinfo call

The crux of the issue seems to be that if we have slow peers (e.g. tor) or peers without the block, we don’t attempt to rotate out the connections in favor of new ones. We do call Punish but that only affects the ordering of work queue items and doesn’t cause a disconnect.

This was also my hunch based on some other reports where neutrino nodes (connected over Tor) would seemingly never report being “synced to chain”. Ultimately, we should remove that hard coded 2 second timeout and instead have the timeout be dynamic based on the ping time to the node (compute full RTT in ideal conditions) and any changes in latency throughout the lifetime of the connection (tor can have a lot of latency jitter at times in my exp).

Another relevant route is modifying the default peer ranking algorithm all together. As mentioned above, it only re-orders the set of peers, but all peers will still be dispatched the set of jobs more or so evenly. What we really want is either a more dynamic ranking (taking into account background pinging) that’ll actually bias towards faster peers over time. Additionally, rather than sort, we should instead sample from the set of peers, which may result in some peers gaining more work that others, whereas rn the work will always be evenly distributed.

One other thing that would help with this situation is implementing proper work stealing. This would allow faster peers to automatically move work from a slower peer into their own queue.

Yeah this a bit confusing, but as soon as you run in pruned mode you use the neutrino codebase to fetch blocks from peers. (https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L192-L208)

You would need to put the LNWL=debug to see the new logs.

I just added the changes from #8267 and restarted my node (I’m not using neutrino, though?).

So I think we have some problems in the neutrino pkg, I increased the logs for the mentioned case and found that we have a block which is rescheduled over 10 times although the maximum retry count should not be greater than 2.

https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L345

Here are the retries:

lnd.759_split_01.log:2023-12-09 16:54:45.057 [TRC] LNWL: Worker panxv3zq3bcp2dcbt2j4f2igbjiswbp7pdpdaofdk5mkjux63ptrssid.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:55:17.061 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:55:49.076 [TRC] LNWL: Worker panxv3zq3bcp2dcbt2j4f2igbjiswbp7pdpdaofdk5mkjux63ptrssid.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:56:21.082 [TRC] LNWL: Worker lxcsrowdge7vxo5k46fb22libqwumvkpgtkt332jtf6myaqs3d6keead.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:56:53.083 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:57:25.084 [TRC] LNWL: Worker h5qvrvkdq5wvuywuxb36toph3i7dxl4tk767w7f6hrbuvo3sqqj2xqqd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:57:57.088 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:58:29.088 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:59:01.089 [TRC] LNWL: Worker h5qvrvkdq5wvuywuxb36toph3i7dxl4tk767w7f6hrbuvo3sqqj2xqqd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:59:33.090 [TRC] LNWL: Worker panxv3zq3bcp2dcbt2j4f2igbjiswbp7pdpdaofdk5mkjux63ptrssid.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 17:00:05.090 [TRC] LNWL: Worker lxcsrowdge7vxo5k46fb22libqwumvkpgtkt332jtf6myaqs3d6keead.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 17:00:37.094 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 17:01:09.096 [TRC] LNWL: Worker h5qvrvkdq5wvuywuxb36toph3i7dxl4tk767w7f6hrbuvo3sqqj2xqqd.onion:8333 timeout for request *wire.MsgGetData with job index 10

So this explains why we have this stuck GetBlock request but I am not sure why we are not exiting out after 2 retries, seems like the only why we would keep going to retry is when somehow the batch reference is nil https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L334-L385

I added some logs in a PR which increases some logs, which might be helpful tracking down the issue maybe you could run it too, very much appreciated.

https://github.com/lightningnetwork/lnd/pull/8267

I’m now also running curl -s http://127.0.0.1:9736/debug/pprof/goroutine?debug=2 | grep -e "GetBlock(\|WaitForDependants(" -B 1 once a minute, so far nothing turned up.

I added logging for the blockcache stuff, maybe that helps once it starts going downhill again.

curl http://127.0.0.1:9736/debug/pprof/goroutine?debug=2 attached:

goroutine.txt

@guggero I think I’m already running 0.17.3 of sorts (I manually added all relevant commits to my messy code base).