lnd: [bug]: Lnd doesn't seem to gracefully shutdown on 0.17.0-beta.rc1
Background
I am running lnd 0.17.0 rc1 plus the fix of Elle that was merged recently. When trying to restart or shutting down LND my docker container never stops, I have 15min timeout in the compose file before a kill signal is sent by docker. I just get these logs and the timeout eventually hits and the container gets killed so LND doesn’t shut down gracefully.
2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(878811056959193089) got: router shutting down 2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(818216971034558465) got: router shutting down 2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(878297584961191937) got: router shutting down 2023-08-26 20:26:08.469 [ERR] DISC: Update edge for short_chan_id(771274421704720385) got: router shutting down 2023-08-26 20:26:13.261 [ERR] DISC: Update edge for short_chan_id(865996249021546496) got: router shutting down 2023-08-26 20:26:13.374 [ERR] DISC: Update edge for short_chan_id(875772006717980672) got: router shutting down
Your environment
- version of
0.17.0-beta.rc1
ubuntu
bitcoind 24
About this issue
- Original URL
- State: open
- Created 10 months ago
- Comments: 47 (8 by maintainers)
Commits related to this issue
- peer: eliminate circular waiting by calling maybeSendNodeAnn async In this commit, we attempt to fix circular waiting scenario introduced inadvertently when [fixing a race condition scenario](https:/... — committed to Roasbeef/lnd by Roasbeef 10 months ago
- peer: eliminate circular waiting by calling maybeSendNodeAnn async (#7938) In this commit, we attempt to fix circular waiting scenario introduced inadvertently when [fixing a race condition scenari... — committed to lightningnetwork/lnd by Roasbeef 10 months ago
@Roasbeef Here’s another goroutine dump. I think the issue still persists.
godump.zip
I am, but I still think the bug exists
Just tested it, seems to be working well.
@guggero @Crypt-iQ To be clear, the original text of this issue
What I have described is the exact same issue described by OP. I added a post to this issue here, the resource consumption and logging patterns I outlined was echoed shortly after by @mycelia1 here, followed up with a goroutine that was reviewed by @guggero here.
I’m not talking seconds, I’m talking 1-2 hours.
The only difference between what I’ve described and the OP that I am not forcing a SIGTERM after 15m. It takes about 1-2 hours to shut down a node for a layer-2 solution that demands ~100% uptime from its nodes. So between the stated purpose and risks of said software, the layer-2 problem it aims to solve and the contents of this thread, it would be difficult to infer anything other than the following:
Blocking a restart while updating non-essential data in a graph in a circumstance that can result in monetary loss, is not the intended behavior.
Also thanks for highlighting that my eyes completely missed the milestone update, I’ll watch out for
0.17.3-beta
Good catch. This doesn’t prevent shutdown, but is certainly not the right behavior. Apparently, that
Stop
function when called fromchannelAttendant
just waits indefinitelyIsn’t that also strange, seems to be circular as
channelAttendant
calls theChannelArbitrator
’sStop
method, which is waiting on the work groupchannelAttendant
it is part of?Noticed something in this log dump that is separate from my comment above:
The
ChannelRouter
is trying toStop
and is waiting for the wait group counter to go to zero. However, there is an outstandingGetBlock
call viafetchFundingTx
:Since you’re running a pruned node, it will wait here for the block: https://github.com/btcsuite/btcwallet/blob/9c135429d5b7c5c84b19eaffce9e6a32e2d9d5d0/chain/bitcoind_conn.go#L479 The logs also show that your node didn’t have any peers at the time and was waiting here: https://github.com/lightninglabs/neutrino/blob/42a196facefac36282e68ddf4a02c9ce4601a0b0/query/workmanager.go#L263. This prevents shutdown until your node can connect to a peer that sends over the block.
Going off of this:
The
ChannelRouter
is waiting for its goroutines to stop. There are two calls toUpdateEdgePolicy
which are hanging, they are waiting for thecacheMu
lock (this is obfuscated somewhat - thecacheMu
mutex is the locker in thechanScheduler
). All of the calls toHasChannelEdge
andMarkEdgeLive
are also waiting for thecacheMu
lock. There is one call toMarkEdgeLive
that is not waiting for thecacheMu
lock:This goroutine that was trying to call
MarkEdgeLive
and write the freelist was preempted (see:runtime.asyncPreempt2
). In this specific scenario, I don’t think there’s a deadlock. But I don’t know why that database transaction isn’t happening. We did update bbolt in 17, but I’m not sure if that’s related. How big is your database @niteshbalusu11 ?Hmm, okay, thanks.
So I see this goroutine waiting for a wait group
Looks like the
readHeader
goroutine didn’t shut down though:It’s stuck at
noiseConn.ReadNextHeader()
: https://github.com/lightningnetwork/lnd/blob/eae9dd07f0579cac94d71e8d8b4c7e3408ce85d0/peer/brontide.go#L1146 Which looks like it doesn’t have a timeout set at all? But could be misinterpreting things…Here are my goroutine logs captured during the repeated “Received ChannelEdgePolicy” sequence. goroutine profile total 429.txt
I compiled master.