lnd: [bug]: LND can't connect to peers/channels, sync to chain is lost
Background
On RoboSats mainnet node, on start up, LND gets to both synced to chain and graph. Connects to a few peers/ channels. Eventually, stops during its way to connect to all peers and channels. It might connect to only 2-30 peers/channels of 140 total channels. Eventually, the number of active channels decays down to 0. Effectively making the node dead. After some time,
synced_to_chain
goes false.
This the systematic behaviour everytime LND has been restarted for the last 12 hour.
As RoboSats uses mainly hodl invoices, in a few hours many force closures will be triggered and recovery will be painful. Hopefully we can get the node back online first 😄
The only interesting thing I see in the logs is a bunch of this (when LND loses chain sync?). But I can grep it if you think I am missing something.
2023-08-24 05:15:28.633 [ERR] BTCN: Can't accept connection: unable to accept connection from 127.0.0.1:45314: write tcp 127.0.0.1:9735->127.0.0.1:45314: write: broken pipe 2023-08-24 05:15:28.959 [ERR] BTCN: Can't accept connection: unable to accept connection from 127.0.0.1:45324: write tcp 127.0.0.1:9735->127.0.0.1:45324: write: broken pipe 2023-08-24 05:15:29.169 [ERR] BTCN: Can't accept connection: unable to accept connection from 127.0.0.1:45336: write tcp 127.0.0.1:9735->127.0.0.1:45336: write: broken pipe
There is one of these every few miliseconds.
I did already fully reboot the machine to no avail. Also updated docker / docker-compose to latest stable.
Your environment
- version of
lnd
: v0.16.4 Official Docker image - which operating system (
uname -a
on *Nix) Dockerized on Linux 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux - version of
btcd
,bitcoind
, or other backend Bitcoin Core 24.0.1 - any other relevant environment details
Steps to reproduce
I will probably be unable to reproduce anything that might have happen to arrive to this state.
Expected behaviour
LND should start and connect to all peers/channels.
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 2
- Comments: 53 (19 by maintainers)
Commits related to this issue
- contractcourt: modify the incoming contest resolver to use concurrent queue In this commit, we modify the incoming contest resolver to use a concurrent queue. This is meant to ensure that the invoice... — committed to Roasbeef/lnd by Roasbeef 9 months ago
- contractcourt: modify the incoming contest resolver to use concurrent queue In this commit, we modify the incoming contest resolver to use a concurrent queue. This is meant to ensure that the invoice... — committed to Roasbeef/lnd by Roasbeef 9 months ago
- contractcourt: modify the incoming contest resolver to use concurrent queue In this commit, we modify the incoming contest resolver to use a concurrent queue. This is meant to ensure that the invoice... — committed to Roasbeef/lnd by Roasbeef 9 months ago
Thanks! Would be nice to get a confirmation that the issue is fixed by #8024. A faster filesystem/disks could make a multi-threading issue less likely to occur.
Since we changed the file system for a much faster one where LND is the only service, the hangs of
lncli addinvoice
have not happened again. It has been over a week without a crash of that kind (this is still on v0.17.0-rc2).However, we had 3
lncli getinfo
hangs in the same week. We are not running yet the fix in #8024. I collectedgoroutine?debug=2
whilelncli getinfo
was not responsive in case it is useful I could share it.Summarized my investigation and current conclusions here: https://github.com/lightningnetwork/lnd/issues/8023.
@Reckless-Satoshi thanks for that last trace! That was exactly what we needed to get to the bottom of this. I think we have the root cause, and are discussing the best route to resolve the issue (have a few candidates of varying diff size).
@Reckless-Satoshi worth noting that for the blocking/mutex profiles, you only want to enable them when you want a profile, then turn it off. Otherwise it adds extra load since it continues to collect samples unlike the other profiling options.
Thank you! By the way, GitHub also allows uploads of text files, so I’m re-posting the full, combined file here again: goroutinedump.txt
The full goroutine dump was very helpful! It looks like we have a deadlock in the HODL invoice system:
I’ll see if I can come up with a fix. If there’s any way for you to temporarily stop using HODL invoices then that should unclog the node. But I assume that’s a main part of your application… So hopefully we can get a patch out soon.
Those stack traces were super helpful! We have a candidate fix here: https://github.com/lightningnetwork/lnd/issues/7928
Sent new logs with Debug verbosity. This time it seems something did fail before any peer was connected.
Edit: sent another run where it connected to 2 channels before chain sync went false.
hieblmi at lightning dot engineering
Hi, thanks for reporting the issue. Did this behavior occur the first time after you adjusted settings or did it start without intervention? Would you be able to provide lnd logs that contain the start up as well as showing the decay of active peers? Can also mail the logs if you don’t want to attach them here.