lnd: [bug]: LND can't connect to peers/channels, sync to chain is lost

Background

On RoboSats mainnet node, on start up, LND gets to both synced to chain and graph. Connects to a few peers/ channels. Eventually, stops during its way to connect to all peers and channels. It might connect to only 2-30 peers/channels of 140 total channels. Eventually, the number of active channels decays down to 0. Effectively making the node dead. After some time, synced_to_chain goes false.

This the systematic behaviour everytime LND has been restarted for the last 12 hour.

As RoboSats uses mainly hodl invoices, in a few hours many force closures will be triggered and recovery will be painful. Hopefully we can get the node back online first 😄

The only interesting thing I see in the logs is a bunch of this (when LND loses chain sync?). But I can grep it if you think I am missing something.

2023-08-24 05:15:28.633 [ERR] BTCN: Can't accept connection: unable to accept connection from 127.0.0.1:45314: write tcp 127.0.0.1:9735->127.0.0.1:45314: write: broken pipe                                        2023-08-24 05:15:28.959 [ERR] BTCN: Can't accept connection: unable to accept connection from 127.0.0.1:45324: write tcp 127.0.0.1:9735->127.0.0.1:45324: write: broken pipe                                        2023-08-24 05:15:29.169 [ERR] BTCN: Can't accept connection: unable to accept connection from 127.0.0.1:45336: write tcp 127.0.0.1:9735->127.0.0.1:45336: write: broken pipe

There is one of these every few miliseconds.

I did already fully reboot the machine to no avail. Also updated docker / docker-compose to latest stable.

Your environment

version of lnd: v0.16.4 Official Docker image
which operating system (uname -a on *Nix) Dockerized on Linux 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
version of btcd, bitcoind, or other backend Bitcoin Core 24.0.1
any other relevant environment details

Steps to reproduce

I will probably be unable to reproduce anything that might have happen to arrive to this state.

Expected behaviour

LND should start and connect to all peers/channels.

About this issue

Original URL
State: closed
Created 10 months ago
Reactions: 2
Comments: 53 (19 by maintainers)

Commits related to this issue

contractcourt: modify the incoming contest resolver to use concurrent queue In this commit, we modify the incoming contest resolver to use a concurrent queue. This is meant to ensure that the invoice... — committed to Roasbeef/lnd by Roasbeef 9 months ago
contractcourt: modify the incoming contest resolver to use concurrent queue In this commit, we modify the incoming contest resolver to use a concurrent queue. This is meant to ensure that the invoice... — committed to Roasbeef/lnd by Roasbeef 9 months ago
contractcourt: modify the incoming contest resolver to use concurrent queue In this commit, we modify the incoming contest resolver to use a concurrent queue. This is meant to ensure that the invoice... — committed to Roasbeef/lnd by Roasbeef 9 months ago

Most upvoted comments

Thanks! Would be nice to get a confirmation that the issue is fixed by #8024. A faster filesystem/disks could make a multi-threading issue less likely to occur.

vasild on Oct 3, 2023

@Reckless-Satoshi, did #8024 fix the issue?

Since we changed the file system for a much faster one where LND is the only service, the hangs of lncli addinvoice have not happened again. It has been over a week without a crash of that kind (this is still on v0.17.0-rc2).

However, we had 3 lncli getinfo hangs in the same week. We are not running yet the fix in #8024. I collected goroutine?debug=2 while lncli getinfo was not responsive in case it is useful I could share it.

Reckless-Satoshi on Oct 3, 2023

Summarized my investigation and current conclusions here: https://github.com/lightningnetwork/lnd/issues/8023.

Roasbeef on Sep 22, 2023

@Reckless-Satoshi thanks for that last trace! That was exactly what we needed to get to the bottom of this. I think we have the root cause, and are discussing the best route to resolve the issue (have a few candidates of varying diff size).

Roasbeef on Sep 22, 2023

@Reckless-Satoshi worth noting that for the blocking/mutex profiles, you only want to enable them when you want a profile, then turn it off. Otherwise it adds extra load since it continues to collect samples unlike the other profiling options.

Roasbeef on Sep 22, 2023

Thank you! By the way, GitHub also allows uploads of text files, so I’m re-posting the full, combined file here again: goroutinedump.txt

The full goroutine dump was very helpful! It looks like we have a deadlock in the HODL invoice system:

--> This goroutine is trying to send on the hodlChannel, which is buffered but probably already has a message in it. This is holding the invoice registry main lock.
goroutine 1667988 [select, 32 minutes]:
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).notifyHodlSubscribers(0xc0005d83c0, {0x22ca860, 0xc015d33e20})
	github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1714 +0x265
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).cancelInvoiceImpl(0xc0005d83c0, {0x4a, 0xe, 0x35, 0x49, 0xd6, 0x77, 0xec, 0xdc, 0x86, ...}, ...)
	github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1371 +0x585
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).CancelInvoice(...)
	github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1293
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.(*Server).CancelInvoice(0xc003f0db40, {0x1cbd1c0?, 0x1dc2da0?}, 0x0?)
	github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_server.go:309 +0xa5
.....


--> 108 goroutines are attempting to add invoices but are blocked on the main invoice registry lock:

goroutine 1690726 [sync.Mutex.Lock, 11 minutes]:
sync.runtime_SemacquireMutex(0x17ffffffffffffff?, 0x0?, 0x1f0000c017c0d050?)
	runtime/sema.go:77 +0x26
sync.(*Mutex).lockSlow(0xc0005d83c0)
	sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)
	sync/mutex.go:90
sync.(*RWMutex).Lock(0xc017c0d0a8?)
	sync/rwmutex.go:147 +0x36
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).AddInvoice(0xc0005d83c0, 0xc0253b6c40, {0x4a, 0x4f, 0x96, 0x4a, 0x90, 0x63, 0x96, 0xf8, ...})
	github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:584 +0x3d
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.AddInvoice({0x0?, 0x0?}, 0xc017c0d538, 0xc017c0d588)
	github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/addinvoice.go:481 +0x1506
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.(*Server).AddHoldInvoice(0xc003f0db40, {0x22e0938, 0xc01060e3f0}, 0xc028838780)
	github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_server.go:367 +0x36c

I’ll see if I can come up with a fix. If there’s any way for you to temporarily stop using HODL invoices then that should unclog the node. But I assume that’s a main part of your application… So hopefully we can get a patch out soon.

guggero on Sep 22, 2023

Those stack traces were super helpful! We have a candidate fix here: https://github.com/lightningnetwork/lnd/issues/7928

Roasbeef on Aug 29, 2023

Sent new logs with Debug verbosity. This time it seems something did fail before any peer was connected.

Edit: sent another run where it connected to 2 channels before chain sync went false.

Reckless-Satoshi on Aug 24, 2023

hieblmi at lightning dot engineering

hieblmi on Aug 24, 2023

Hi, thanks for reporting the issue. Did this behavior occur the first time after you adjusted settings or did it start without intervention? Would you be able to provide lnd logs that contain the start up as well as showing the decay of active peers? Can also mail the logs if you don’t want to attach them here.

hieblmi on Aug 24, 2023