lnd: [bug]: Lnd doesn't seem to gracefully shutdown on 0.17.0-beta.rc1

Background

I am running lnd 0.17.0 rc1 plus the fix of Elle that was merged recently. When trying to restart or shutting down LND my docker container never stops, I have 15min timeout in the compose file before a kill signal is sent by docker. I just get these logs and the timeout eventually hits and the container gets killed so LND doesn’t shut down gracefully.

2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(878811056959193089) got: router shutting down 2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(818216971034558465) got: router shutting down 2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(878297584961191937) got: router shutting down 2023-08-26 20:26:08.469 [ERR] DISC: Update edge for short_chan_id(771274421704720385) got: router shutting down 2023-08-26 20:26:13.261 [ERR] DISC: Update edge for short_chan_id(865996249021546496) got: router shutting down 2023-08-26 20:26:13.374 [ERR] DISC: Update edge for short_chan_id(875772006717980672) got: router shutting down

Your environment

version of 0.17.0-beta.rc1
ubuntu
bitcoind 24

About this issue

Original URL
State: open
Created 10 months ago
Comments: 47 (8 by maintainers)

Commits related to this issue

peer: eliminate circular waiting by calling maybeSendNodeAnn async In this commit, we attempt to fix circular waiting scenario introduced inadvertently when [fixing a race condition scenario](https:/... — committed to Roasbeef/lnd by Roasbeef 10 months ago
peer: eliminate circular waiting by calling maybeSendNodeAnn async (#7938) In this commit, we attempt to fix circular waiting scenario introduced inadvertently when [fixing a race condition scenari... — committed to lightningnetwork/lnd by Roasbeef 10 months ago

Most upvoted comments

@Roasbeef Here’s another goroutine dump. I think the issue still persists.

godump.zip

niteshbalusu11 on Sep 19, 2023

Is anyone using 0.17.0-rc2 with this patch in production?

I am, but I still think the bug exists

niteshbalusu11 on Sep 12, 2023

Just tested it, seems to be working well.

niteshbalusu11 on Aug 30, 2023

@guggero @Crypt-iQ To be clear, the original text of this issue

I am running lnd 0.17.0 rc1 plus the fix of Elle that was merged recently. When trying to restart or shutting down LND my docker container never stops, I have 15min timeout in the compose file before a kill signal is sent by docker. I just get these logs and the timeout eventually hits and the container gets killed so LND doesn’t shut down gracefully.

What I have described is the exact same issue described by OP. I added a post to this issue here, the resource consumption and logging patterns I outlined was echoed shortly after by @mycelia1 here, followed up with a goroutine that was reviewed by @guggero here.

That just sounds like a logging issue. Shutting down does involve quite a bit of work, so it might take some seconds.

I’m not talking seconds, I’m talking 1-2 hours.

The only difference between what I’ve described and the OP that I am not forcing a SIGTERM after 15m. It takes about 1-2 hours to shut down a node for a layer-2 solution that demands ~100% uptime from its nodes. So between the stated purpose and risks of said software, the layer-2 problem it aims to solve and the contents of this thread, it would be difficult to infer anything other than the following:

Blocking a restart while updating non-essential data in a graph in a circumstance that can result in monetary loss, is not the intended behavior.

Also thanks for highlighting that my eyes completely missed the milestone update, I’ll watch out for 0.17.3-beta

dskvr on Dec 4, 2023

Isn’t that also strange, seems to be circular as channelAttendant calls the ChannelArbitrator’s Stop method, which is waiting on the work group channelAttendant it is part of?

Good catch. This doesn’t prevent shutdown, but is certainly not the right behavior. Apparently, that Stop function when called from channelAttendant just waits indefinitely

Crypt-iQ on Nov 3, 2023

goroutine 33483 [semacquire, 4282 minutes]:\
runtime.gopark(0xc000132530?, 0xc002271c20?, 0xc0?, 0x20?, 0xc006c0d0c0?)\
        runtime/proc.go:398 +0xce fp=0xc006c0d048 sp=0xc006c0d028 pc=0x43df0e\
runtime.goparkunlock(...)\
        runtime/proc.go:404\
runtime.semacquire1(0xc002e06d88, 0x1e?, 0x1, 0x0, 0x60?)\
        runtime/sema.go:160 +0x218 fp=0xc006c0d0b0 sp=0xc006c0d048 pc=0x44f3b8\
sync.runtime_Semacquire(0xc006c0d100?)\
        runtime/sema.go:62 +0x25 fp=0xc006c0d0e8 sp=0xc006c0d0b0 pc=0x46cea5\
sync.(*WaitGroup).Wait(0xc002b9dce0?)\
        sync/waitgroup.go:116 +0x48 fp=0xc006c0d110 sp=0xc006c0d0e8 pc=0x48f128\
github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).Stop(0x1ac59e0?)\
        github.com/lightningnetwork/lnd/contractcourt/channel_arbitrator.go:826 +0x19c fp=0xc006c0d178 sp=0xc006c0d110 pc=0x150c8fc\
github.com/lightningnetwork/lnd/contractcourt.(*ChainArbitrator).ResolveContract(0xc0057ca4e0, \{\{0x6a, 0x25, 0xc7, 0x8, 0x2, 0x4e, 0xce, 0x95, 0x27, ...\}, ...\})\
        github.com/lightningnetwork/lnd/contractcourt/chain_arbitrator.go:468 +0x21c fp=0xc006c0d228 sp=0xc006c0d178 pc=0x15010bc\
github.com/lightningnetwork/lnd/contractcourt.newActiveChannelArbitrator.func4()\
        github.com/lightningnetwork/lnd/contractcourt/chain_arbitrator.go:404 +0x78 fp=0xc006c0d278 sp=0xc006c0d228 pc=0x1500cb8\
github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).stateStep(0xc002e06b00, 0x1e93776?, 0x0, 0x1?)\
        github.com/lightningnetwork/lnd/contractcourt/channel_arbitrator.go:1265 +0x107c fp=0xc006c0dbb0 sp=0xc006c0d278 pc=0x150da1c\
github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).advanceState(0xc002e06b00, 0x1e80b45?, 0x0, 0xc001a40640?)\
        github.com/lightningnetwork/lnd/contractcourt/channel_arbitrator.go:1512 +0x17b fp=0xc006c0dc50 sp=0xc006c0dbb0 pc=0x150fadb\
github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).channelAttendant(0xc002e06b00, 0xc5454)\
        github.com/lightningnetwork/lnd/contractcourt/channel_arbitrator.go:2930 +0x10b2 fp=0xc006c0dfc0 sp=0xc006c0dc50 pc=0x1518352\
github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).Start.func2()\
        github.com/lightningnetwork/lnd/contractcourt/channel_arbitrator.go:567 +0x25 fp=0xc006c0dfe0 sp=0xc006c0dfc0 pc=0x150b205\
runtime.goexit()\
        runtime/asm_amd64.s:1650 +0x1 fp=0xc006c0dfe8 sp=0xc006c0dfe0 pc=0x4710e1\
created by github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).Start in goroutine 1\
        github.com/lightningnetwork/lnd/contractcourt/channel_arbitrator.go:567 +0x68e\
\

Isn’t that also strange, seems to be circular as channelAttendant calls the ChannelArbitrator’s Stop method, which is waiting on the work group channelAttendant it is part of?

bitromortac on Nov 3, 2023

Noticed something in this log dump that is separate from my comment above:

@Roasbeef Here’s another goroutine dump. I think the issue still persists.

godump.zip

The ChannelRouter is trying to Stop and is waiting for the wait group counter to go to zero. However, there is an outstanding GetBlock call via fetchFundingTx:

goroutine 15327295 [select, 43 minutes]:\
runtime.gopark(0xc00d085878?, 0x3?, 0x20?, 0x70?, 0xc00d08583a?)\
        runtime/proc.go:398 +0xce fp=0xc00d0856d0 sp=0xc00d0856b0 pc=0x43df0e\
runtime.selectgo(0xc00d085878, 0xc00d085834, 0x6238e8?, 0x0, 0x0?, 0x1)\
        runtime/select.go:327 +0x725 fp=0xc00d0857f0 sp=0xc00d0856d0 pc=0x44e385\
github.com/btcsuite/btcwallet/chain.(*BitcoindConn).GetBlock(0xc00064c580, 0xc01d6d3da0)\
        github.com/btcsuite/btcwallet@v0.16.10-0.20230804184612-07be54bc22cf/chain/bitcoind_conn.go:410 +0x1d1 fp=0xc00d0858b8 sp=0xc00d0857f0 pc=0x947fd1\
github.com/btcsuite/btcwallet/chain.(*BitcoindClient).GetBlock(0xc008e2b770?, 0x0?)\
        github.com/btcsuite/btcwallet@v0.16.10-0.20230804184612-07be54bc22cf/chain/bitcoind_client.go:144 +0x17 fp=0xc00d0858d8 sp=0xc00d0858b8 pc=0x93f5f7\
github.com/btcsuite/btcwallet/chain.Interface.GetBlock-fm(0x96826bc740000002?)\
        <autogenerated>:1 +0x2b fp=0xc00d0858f8 sp=0xc00d0858d8 pc=0x1580d8b\
github.com/lightningnetwork/lnd/blockcache.(*BlockCache).GetBlock(0xc0014af8c0, 0xc01d6d3da0, 0xc00d085ac8)\
        github.com/lightningnetwork/lnd/blockcache/blockcache.go:55 +0x22a fp=0xc00d085a38 sp=0xc00d0858f8 pc=0x156e10a\
github.com/lightningnetwork/lnd/lnwallet/btcwallet.(*BtcWallet).GetBlock(0x1000001e00bab?, 0x7fdcf6b86c78?)\
        github.com/lightningnetwork/lnd/lnwallet/btcwallet/blockchain.go:139 +0x1de fp=0xc00d085b30 sp=0xc00d085a38 pc=0x1575d9e\
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTx(0xc000149040, 0xc008e2ade0)\
        github.com/lightningnetwork/lnd/routing/router.go:1780 +0x5c fp=0xc00d085ba0 sp=0xc00d085b30 pc=0x15a0bfc\
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).processUpdate(0xc000149040, \{0x1c82220?, 0xc018812580?\}, \{0xc001a0b0d8, 0x1, 0x1\})\
        github.com/lightningnetwork/lnd/routing/router.go:1555 +0xea7 fp=0xc00d085f10 sp=0xc00d085ba0 pc=0x159fca7\
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).handleNetworkUpdate(0xc000149040, 0xc009855bf0, 0xc0066a67e0)\
        github.com/lightningnetwork/lnd/routing/router.go:1031 +0x10d fp=0xc00d085fb8 sp=0xc00d085f10 pc=0x159c72d\
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler.func4()\
        github.com/lightningnetwork/lnd/routing/router.go:1129 +0x28 fp=0xc00d085fe0 sp=0xc00d085fb8 pc=0x159db08\
runtime.goexit()\
        runtime/asm_amd64.s:1650 +0x1 fp=0xc00d085fe8 sp=0xc00d085fe0 pc=0x4710e1\
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler in goroutine 61461\
        github.com/lightningnetwork/lnd/routing/router.go:1129 +0xbbf\

Since you’re running a pruned node, it will wait here for the block: https://github.com/btcsuite/btcwallet/blob/9c135429d5b7c5c84b19eaffce9e6a32e2d9d5d0/chain/bitcoind_conn.go#L479 The logs also show that your node didn’t have any peers at the time and was waiting here: https://github.com/lightninglabs/neutrino/blob/42a196facefac36282e68ddf4a02c9ce4601a0b0/query/workmanager.go#L263. This prevents shutdown until your node can connect to a peer that sends over the block.

Crypt-iQ on Nov 2, 2023

Going off of this:

logs.zip

The ChannelRouter is waiting for its goroutines to stop. There are two calls to UpdateEdgePolicy which are hanging, they are waiting for the cacheMu lock (this is obfuscated somewhat - the cacheMu mutex is the locker in the chanScheduler). All of the calls to HasChannelEdge and MarkEdgeLive are also waiting for the cacheMu lock. There is one call to MarkEdgeLive that is not waiting for the cacheMu lock:

goroutine 48194 [runnable]:\
runtime.asyncPreempt2()\
        runtime/preempt.go:307 +0x3f fp=0xc002a5cd38 sp=0xc002a5cd18 pc=0x43893f\
runtime.asyncPreempt()\
        runtime/preempt_amd64.s:53 +0xdb fp=0xc002a5cec0 sp=0xc002a5cd38 pc=0x46e8db\
go.etcd.io/bbolt.(*pgids).Less(0x9, 0x459b2, 0x45862)\
        <autogenerated>:1 +0x9 fp=0xc002a5cee0 sp=0xc002a5cec0 pc=0xa49849\
sort.partition(\{0x22dfca0, 0xc0069ceb70\}, 0x455c2, 0x45b05, 0xc0069ceb70?)\
        sort/zsortinterface.go:154 +0x182 fp=0xc002a5cf18 sp=0xc002a5cee0 pc=0x478102\
sort.pdqsort(\{0x22dfca0, 0xc0069ceb70\}, 0x449b1?, 0x4ab9c?, 0xc0069ceb70?)\
        sort/zsortinterface.go:114 +0x225 fp=0xc002a5cf98 sp=0xc002a5cf18 pc=0x477e65\
sort.pdqsort(\{0x22dfca0, 0xc0069ceb70\}, 0xf57c?, 0x4f890?, 0x40dc1d?)\
        sort/zsortinterface.go:121 +0x276 fp=0xc002a5d018 sp=0xc002a5cf98 pc=0x477eb6\
sort.pdqsort(\{0x22dfca0, 0xc0069ceb70\}, 0x18?, 0x1b64280?, 0xc001278501?)\
        sort/zsortinterface.go:125 +0x2b5 fp=0xc002a5d098 sp=0xc002a5d018 pc=0x477ef5\
sort.Sort(\{0x22dfca0, 0xc0069ceb70\})\
        sort/sort.go:48 +0x5d fp=0xc002a5d0d0 sp=0xc002a5d098 pc=0x4761bd\
go.etcd.io/bbolt.(*freelist).hashmapGetFreePageIDs(0xc000664380)\
        go.etcd.io/bbolt@v1.3.7/freelist_hmap.go:84 +0x165 fp=0xc002a5d1e0 sp=0xc002a5d0d0 pc=0xa3f825\
go.etcd.io/bbolt.(*freelist).hashmapGetFreePageIDs-fm()\
        <autogenerated>:1 +0x26 fp=0xc002a5d1f8 sp=0xc002a5d1e0 pc=0xa4a206\
go.etcd.io/bbolt.(*freelist).copyall(0xc000664380, \{0xc014c4c018, 0xae16c, 0xae16c\})\
        go.etcd.io/bbolt@v1.3.7/freelist.go:104 +0x21d fp=0xc002a5d2f8 sp=0xc002a5d1f8 pc=0xa3d6dd\
go.etcd.io/bbolt.(*freelist).write(0xc0006d6ee0?, 0xc014c4c000)\
        go.etcd.io/bbolt@v1.3.7/freelist.go:335 +0xcb fp=0xc002a5d360 sp=0xc002a5d2f8 pc=0xa3e84b\
go.etcd.io/bbolt.(*Tx).commitFreelist(0xc0006d6ee0)\
        go.etcd.io/bbolt@v1.3.7/tx.go:243 +0xaa fp=0xc002a5d3a8 sp=0xc002a5d360 pc=0xa44f0a\
go.etcd.io/bbolt.(*Tx).Commit(0xc0006d6ee0)\
        go.etcd.io/bbolt@v1.3.7/tx.go:178 +0x1cd fp=0xc002a5d4d0 sp=0xc002a5d3a8 pc=0xa44a2d\
github.com/btcsuite/btcwallet/walletdb/bdb.(*transaction).Commit(0x22ebec0?)\
        github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:101 +0x25 fp=0xc002a5d508 sp=0xc002a5d4d0 pc=0xa4b1a5\
github.com/btcsuite/btcwallet/walletdb/bdb.(*db).Update(0xc002a5d300?, 0xc002a04ad0, 0xc002a5d2e0?)\
        github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:437 +0x11c fp=0xc002a5d580 sp=0xc002a5d508 pc=0xa4e13c\
github.com/lightningnetwork/lnd/kvdb.Update(...)\
        github.com/lightningnetwork/lnd/kvdb@v1.4.2/interface.go:16\
github.com/lightningnetwork/lnd/channeldb.(*ChannelGraph).MarkEdgeLive(0xc0009fe5a0, 0x0)\
        github.com/lightningnetwork/lnd/channeldb/graph.go:3899 +0xe5 fp=0xc002a5d610 sp=0xc002a5d580 pc=0x1514185\
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).MarkEdgeLive(0xc0069d1550?, \{0x2409040?, 0xc0?, 0x0?\})\
        github.com/lightningnetwork/lnd/routing/router.go:2766 +0x35 fp=0xc002a5d630 sp=0xc002a5d610 pc=0x166a1f5\
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).processZombieUpdate(0xc0065ad600, 0xc0030d4580, 0xc0069d1550)\
        github.com/lightningnetwork/lnd/discovery/gossiper.go:2056 +0x1b7 fp=0xc002a5d6c0 sp=0xc002a5d630 pc=0x176bfb7\
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).handleChanUpdate(0xc0065ad600, 0xc00453d700, 0xc0069d1550, \{0xc0065830a0, 0x1, 0x1\})\
        github.com/lightningnetwork/lnd/discovery/gossiper.go:2731 +0xb1e fp=0xc002a5de90 sp=0xc002a5d6c0 pc=0x1770ede\
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).processNetworkAnnouncement(0xc007c0d1d0?, 0xc00453d700)\
        github.com/lightningnetwork/lnd/discovery/gossiper.go:2005 +0xfe fp=0xc002a5def8 sp=0xc002a5de90 pc=0x176bd5e\
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).handleNetworkMessages(0xc0065ad600, 0xc00453d700, 0x18ff846?, 0xc007c0d1d0)\
        github.com/lightningnetwork/lnd/discovery/gossiper.go:1506 +0x22d fp=0xc002a5dfb0 sp=0xc002a5def8 pc=0x1768f4d\
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).networkHandler.func4()\
        github.com/lightningnetwork/lnd/discovery/gossiper.go:1421 +0x32 fp=0xc002a5dfe0 sp=0xc002a5dfb0 pc=0x1768bd2\
runtime.goexit()\
        runtime/asm_amd64.s:1598 +0x1 fp=0xc002a5dfe8 sp=0xc002a5dfe0 pc=0x46d261\
created by github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).networkHandler\
        github.com/lightningnetwork/lnd/discovery/gossiper.go:1421 +0x913\

This goroutine that was trying to call MarkEdgeLive and write the freelist was preempted (see: runtime.asyncPreempt2). In this specific scenario, I don’t think there’s a deadlock. But I don’t know why that database transaction isn’t happening. We did update bbolt in 17, but I’m not sure if that’s related. How big is your database @niteshbalusu11 ?

Crypt-iQ on Nov 2, 2023

Hmm, okay, thanks.

So I see this goroutine waiting for a wait group

7 @ 0x43df8e 0x44f438 0x44f40f 0x46cf25 0x48f1a8 0x181c312 0x18e0545 0x471161
#       0x46cf24        sync.runtime_Semacquire+0x24                                            runtime/sema.go:62
#       0x48f1a7        sync.(*WaitGroup).Wait+0x47                                             sync/waitgroup.go:116
#       0x181c311       github.com/lightningnetwork/lnd/peer.(*Brontide).WaitForDisconnect+0xd1 github.com/lightningnetwork/lnd/peer/brontide.go:1102
#       0x18e0544       github.com/lightningnetwork/lnd.(*server).peerTerminationWatcher+0x84   github.com/lightningnetwork/lnd/server.go:4017

Looks like the readHeader goroutine didn’t shut down though:

7 @ 0x43df8e 0x436677 0x46b625 0x4deae7 0x4dfdda 0x4dfdc8 0x5526e5 0x564285 0x4b81d0 0x16fcd88 0x16fcd62 0x16f7a25 0x181c690 0x181e18d 0x471161
#       0x46b624        internal/poll.runtime_pollWait+0x84                                     runtime/netpoll.go:343
#       0x4deae6        internal/poll.(*pollDesc).wait+0x26                                     internal/poll/fd_poll_runtime.go:84
#       0x4dfdd9        internal/poll.(*pollDesc).waitRead+0x279                                internal/poll/fd_poll_runtime.go:89
#       0x4dfdc7        internal/poll.(*FD).Read+0x267                                          internal/poll/fd_unix.go:164
#       0x5526e4        net.(*netFD).Read+0x24                                                  net/fd_posix.go:55
#       0x564284        net.(*conn).Read+0x44                                                   net/net.go:179
#       0x4b81cf        io.ReadAtLeast+0x8f                                                     io/io.go:335
#       0x16fcd87       io.ReadFull+0x47                                                        io/io.go:354
#       0x16fcd61       github.com/lightningnetwork/lnd/brontide.(*Machine).ReadHeader+0x21     github.com/lightningnetwork/lnd/brontide/noise.go:852
#       0x16f7a24       github.com/lightningnetwork/lnd/brontide.(*Conn).ReadNextHeader+0x44    github.com/lightningnetwork/lnd/brontide/conn.go:127
#       0x181c68f       github.com/lightningnetwork/lnd/peer.(*Brontide).readNextMessage+0xaf   github.com/lightningnetwork/lnd/peer/brontide.go:1146
#       0x181e18c       github.com/lightningnetwork/lnd/peer.(*Brontide).readHandler+0x2ec      github.com/lightningnetwork/lnd/peer/brontide.go:1519

It’s stuck at noiseConn.ReadNextHeader(): https://github.com/lightningnetwork/lnd/blob/eae9dd07f0579cac94d71e8d8b4c7e3408ce85d0/peer/brontide.go#L1146 Which looks like it doesn’t have a timeout set at all? But could be misinterpreting things…

guggero on Oct 23, 2023

Here are my goroutine logs captured during the repeated “Received ChannelEdgePolicy” sequence. goroutine profile total 429.txt

mycelia1 on Oct 19, 2023

I compiled master.

niteshbalusu11 on Aug 29, 2023