lnd: [bug]: Lnd doesn't seem to gracefully shutdown on 0.17.0-beta.rc1
Background
I am running lnd 0.17.0 rc1 plus the fix of Elle that was merged recently. When trying to restart or shutting down LND my docker container never stops, I have 15min timeout in the compose file before a kill signal is sent by docker. I just get these logs and the timeout eventually hits and the container gets killed so LND doesn’t shut down gracefully.
2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(878811056959193089) got: router shutting down 2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(818216971034558465) got: router shutting down 2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(878297584961191937) got: router shutting down 2023-08-26 20:26:08.469 [ERR] DISC: Update edge for short_chan_id(771274421704720385) got: router shutting down 2023-08-26 20:26:13.261 [ERR] DISC: Update edge for short_chan_id(865996249021546496) got: router shutting down 2023-08-26 20:26:13.374 [ERR] DISC: Update edge for short_chan_id(875772006717980672) got: router shutting down
Your environment
- version of
0.17.0-beta.rc1 ubuntubitcoind 24
About this issue
- Original URL
- State: open
- Created 10 months ago
- Comments: 47 (8 by maintainers)
Commits related to this issue
- peer: eliminate circular waiting by calling maybeSendNodeAnn async In this commit, we attempt to fix circular waiting scenario introduced inadvertently when [fixing a race condition scenario](https:/... — committed to Roasbeef/lnd by Roasbeef 10 months ago
- peer: eliminate circular waiting by calling maybeSendNodeAnn async (#7938) In this commit, we attempt to fix circular waiting scenario introduced inadvertently when [fixing a race condition scenari... — committed to lightningnetwork/lnd by Roasbeef 10 months ago
@Roasbeef Here’s another goroutine dump. I think the issue still persists.
godump.zip
I am, but I still think the bug exists
Just tested it, seems to be working well.
@guggero @Crypt-iQ To be clear, the original text of this issue
What I have described is the exact same issue described by OP. I added a post to this issue here, the resource consumption and logging patterns I outlined was echoed shortly after by @mycelia1 here, followed up with a goroutine that was reviewed by @guggero here.
I’m not talking seconds, I’m talking 1-2 hours.
The only difference between what I’ve described and the OP that I am not forcing a SIGTERM after 15m. It takes about 1-2 hours to shut down a node for a layer-2 solution that demands ~100% uptime from its nodes. So between the stated purpose and risks of said software, the layer-2 problem it aims to solve and the contents of this thread, it would be difficult to infer anything other than the following:
Blocking a restart while updating non-essential data in a graph in a circumstance that can result in monetary loss, is not the intended behavior.
Also thanks for highlighting that my eyes completely missed the milestone update, I’ll watch out for
0.17.3-betaGood catch. This doesn’t prevent shutdown, but is certainly not the right behavior. Apparently, that
Stopfunction when called fromchannelAttendantjust waits indefinitelyIsn’t that also strange, seems to be circular as
channelAttendantcalls theChannelArbitrator’sStopmethod, which is waiting on the work groupchannelAttendantit is part of?Noticed something in this log dump that is separate from my comment above:
The
ChannelRouteris trying toStopand is waiting for the wait group counter to go to zero. However, there is an outstandingGetBlockcall viafetchFundingTx:Since you’re running a pruned node, it will wait here for the block: https://github.com/btcsuite/btcwallet/blob/9c135429d5b7c5c84b19eaffce9e6a32e2d9d5d0/chain/bitcoind_conn.go#L479 The logs also show that your node didn’t have any peers at the time and was waiting here: https://github.com/lightninglabs/neutrino/blob/42a196facefac36282e68ddf4a02c9ce4601a0b0/query/workmanager.go#L263. This prevents shutdown until your node can connect to a peer that sends over the block.
Going off of this:
The
ChannelRouteris waiting for its goroutines to stop. There are two calls toUpdateEdgePolicywhich are hanging, they are waiting for thecacheMulock (this is obfuscated somewhat - thecacheMumutex is the locker in thechanScheduler). All of the calls toHasChannelEdgeandMarkEdgeLiveare also waiting for thecacheMulock. There is one call toMarkEdgeLivethat is not waiting for thecacheMulock:This goroutine that was trying to call
MarkEdgeLiveand write the freelist was preempted (see:runtime.asyncPreempt2). In this specific scenario, I don’t think there’s a deadlock. But I don’t know why that database transaction isn’t happening. We did update bbolt in 17, but I’m not sure if that’s related. How big is your database @niteshbalusu11 ?Hmm, okay, thanks.
So I see this goroutine waiting for a wait group
Looks like the
readHeadergoroutine didn’t shut down though:It’s stuck at
noiseConn.ReadNextHeader(): https://github.com/lightningnetwork/lnd/blob/eae9dd07f0579cac94d71e8d8b4c7e3408ce85d0/peer/brontide.go#L1146 Which looks like it doesn’t have a timeout set at all? But could be misinterpreting things…Here are my goroutine logs captured during the repeated “Received ChannelEdgePolicy” sequence. goroutine profile total 429.txt
I compiled master.