lnd: [bug]: LND frequently loses sync to chain with pruned backend
Background
I run LND compiled from source on a Ubuntu VPS. I use Bitcoin Core as my backend, pruned to 100GB (prune=100000
). I’ve noticed on various occasions that LND loses its sync to chain (“synced_to_chain”: false,`) and does not recover by itself. A restart always fixes the issue promptly, although in some cases LND cannot be gracefully shut down. I have various peers with the same issue, all are running a very similar setup. The issue seems to be related to pruning the node. The more aggressively you prune the node, the more often LND will lose sync to the chain.
Your environment
0.17.1-beta commit=lightning-terminal-v0.12.1-alpha-dirty
Linux 5.4.0-167-generic
- Bitcoin Core
Version: 240100
prune=100000
lnd.tor.active=true
lnd.tor.skip-proxy-for-clearnet-targets=1
Steps to reproduce
Tell us how to reproduce this issue. Please provide stacktraces and links to code in question.
Expected behaviour
I expect LND to stay synced to chain as long as the backend is synced to chain.
Actual behaviour
LND occasionally loses sync to chain and does not recover while Bitcoin Core is running and synced.
I have about 1.5h of logs and a go profile attached below.
About this issue
- Original URL
- State: open
- Created 7 months ago
- Reactions: 2
- Comments: 44 (21 by maintainers)
Ok I was able to track down the issue, will provide a fix shortly.
So the problem is here:
When a block is not on our local backend (we already pruned it) we will try to fetch it from other nodes here:
https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/bitcoind_conn.go#L474
Now the problem:
We create a new block request here:
https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L504-L519
Here comes the problem, first of all the 2 seconds and then 4 seconds for the second try are too short which is the root cause why we are having a lot of timeouts and the block is delivered couple of seconds later.
When we have two timeouts we will never delete the block from the
d.blocksQueried
this only happens if we receive the block in time:https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L650
So this means the problem arises when the same block is requested a second time (this might take a while to occur) but then we will think there is already a request going on in the dispatcher see here:
https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L544-L553
So this will never launch a new request and the
GetBlock
call will hang forever, blocking the ValidationBarrier eventually.So I think there are several things which can be fixed immediately, will come up with a plan tomorrow.
EDIT: Huge thanks to @M1ch43lV for running the draft PR to extract the logs which lead to the finding of the bug 🙏
Moreover while trying to reproduce the bug I see that we are favouring tor inbound connection over inbound clearnet connection. (https://github.com/btcsuite/btcwallet/blob/master/chain/pruned_block_dispatcher.go#L281)
We poll and connect to peers, the problem is the following, we use the
getpeerinfo
cmd and then use theaddr
tag of its response.Now comes the problem, bitcoind will list the outgoing port of the connection for inbound clearnet connections:
Don’t be confused about the bind addr (I am running behind a vpn).
But as we see the addr field will list the outgoing port for the connection. Now we will try to connect to this connection and will fail obviously because this is not the listening port of the peer. So we will neglect every inbound clearnet peer and will only be able to connect to
outbound peers
orinbound tor peers
. That explains why the example here had only tor peers during the retries of the block fetching. Normally bitcoind has only 8 outbound peers and depending on the setting way more inbound peers. So the probability that we end up with tor only nodes is very high.So I think we should sort the peer connection by outbound/inbound first and then first try all the outbound peers and only after that try inbound tor peers. I am not sure whether we can fetch the listening port via a bitcoind rpc command, I searched but could not find anything.
I think
batch
isnil
because the entry gets retried and added tocurrentQueries
but the batch timer is hit which deletes the batch fromcurrentBatches
: https://github.com/lightninglabs/neutrino/blob/42a196facefac36282e68ddf4a02c9ce4601a0b0/query/workmanager.go#L412-L426. I think this could happen if the node added the batch here https://github.com/lightninglabs/neutrino/blob/42a196facefac36282e68ddf4a02c9ce4601a0b0/query/workmanager.go#L449-L455 (starting the batch timer) and had no peers or peers that were slow to do the version-verack handshake.The crux of the issue seems to be that if we have slow peers (e.g. tor) or peers without the block, we don’t attempt to rotate out the connections in favor of new ones. We do call
Punish
but that only affects the ordering of work queue items and doesn’t cause a disconnect. While https://github.com/lightningnetwork/lnd/pull/8151 fixed one specific case of a shutdown hang, it seems like there could be other places whereGetBlock
calls block (no pun intended lol) so it wasn’t a complete fix.While we’re at it we could probably also bump up the 2 second timeout because tor nodes have higher latency and we don’t want to fail getting a block.
Analysing this a bit more I think I found the problem:
The ChannelRouter initializes the Barrier with
4*NumberofCore Slots
: for the tunnelsats node:proc
= 4 its 16https://github.com/lightningnetwork/lnd/blob/master/routing/router.go#L1145
Now the pprof of the time when the problem happens count exacty 16 stuck goroutines using up 16 slots hence we are in a deadlock somehow in terms of advancing the blockheight for the router:
The stuck routines are 2 here:
and 14 times on the
WaitForDependants
So not sure why the
GetBlock
call in the first cited goroutine hangs, this might be pruned specific but I thing this behaviour an also happen without a prunded not in place?Wdyt @Crypt-iQ ?
I mention a route here to dynamically tune the timeout, which shouldn’t be too involved: https://github.com/lightninglabs/neutrino/issues/292
As a starting point, we can use the reported ping latency from the
getpeerinfo
call. Then we can apply an RTT multiplier on top of a processing multiplier. Each time we get a response, we use that to tune what the actual timeout should be (some sort of moving weighted average).Perhaps what we also need is a callback for if the request times out and we’re unable get a response? That would then allow us to remove it from the set of in flight requests.
IIUC, upon an error, we’ll get a send on the
errChan
, but currently we just return early, whereas we could update the set of inflight requests there: https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/bitcoind_conn.go#L483-L490The error shuold come from here once the max number of retries is exceeded: https://github.com/lightninglabs/neutrino/blob/42a196facefac36282e68ddf4a02c9ce4601a0b0/query/workmanager.go#L341-L358
I think we can use the
getnodeaddresses
command which fetches from bitcoind’s address manager. I think anything in the address manager should have the listening port rather than the randomly assigned TCP portEDIT: oops, realize we do this. But I think we can use this over the
getpeerinfo
callThis was also my hunch based on some other reports where neutrino nodes (connected over Tor) would seemingly never report being “synced to chain”. Ultimately, we should remove that hard coded 2 second timeout and instead have the timeout be dynamic based on the ping time to the node (compute full RTT in ideal conditions) and any changes in latency throughout the lifetime of the connection (tor can have a lot of latency jitter at times in my exp).
Another relevant route is modifying the default peer ranking algorithm all together. As mentioned above, it only re-orders the set of peers, but all peers will still be dispatched the set of jobs more or so evenly. What we really want is either a more dynamic ranking (taking into account background pinging) that’ll actually bias towards faster peers over time. Additionally, rather than sort, we should instead sample from the set of peers, which may result in some peers gaining more work that others, whereas rn the work will always be evenly distributed.
One other thing that would help with this situation is implementing proper work stealing. This would allow faster peers to automatically move work from a slower peer into their own queue.
Yeah this a bit confusing, but as soon as you run in pruned mode you use the neutrino codebase to fetch blocks from peers. (https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L192-L208)
You would need to put the LNWL=debug to see the new logs.
I just added the changes from #8267 and restarted my node (I’m not using neutrino, though?).
So I think we have some problems in the neutrino pkg, I increased the logs for the mentioned case and found that we have a block which is rescheduled over 10 times although the maximum retry count should not be greater than 2.
https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L345
So this explains why we have this stuck
GetBlock
request but I am not sure why we are not exiting out after 2 retries, seems like the only why we would keep going to retry is when somehow the batch reference isnil
https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L334-L385I added some logs in a PR which increases some logs, which might be helpful tracking down the issue maybe you could run it too, very much appreciated.
https://github.com/lightningnetwork/lnd/pull/8267
I’m now also running
curl -s http://127.0.0.1:9736/debug/pprof/goroutine?debug=2 | grep -e "GetBlock(\|WaitForDependants(" -B 1
once a minute, so far nothing turned up.I added logging for the
blockcache
stuff, maybe that helps once it starts going downhill again.curl http://127.0.0.1:9736/debug/pprof/goroutine?debug=2
attached:goroutine.txt
@guggero I think I’m already running 0.17.3 of sorts (I manually added all relevant commits to my messy code base).