lnd: [bug]: lncli getinfo and LND in general, getting stuck at COMMIT when using Postgres

There are no errors in the logs, other commands like walletbalance and getnetworkinfo return (they take a few seconds tho), but getinfo and listpeers and maybe others take forever to return… no error in the logs. I am using 0.17 RC3 on raspberry pi 4 8GB. The CPU is at 1% or less most of the time, with some spikes at 40-70% either from LND or Bitcoind…

Logs: https://pastebin.com/W10vaKgF

image

LND: lncli version 0.17.0-beta.rc3 commit=v0.17.0-beta.rc3-19-g9f4a8836d

OS: Linux raspberrypi 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux

Examples:

pi@raspberrypi:~ $ time ll walletbalance
{
    "total_balance": "33268",
    "confirmed_balance": "33268",
    "unconfirmed_balance": "0",
    "locked_balance": "0",
    "reserved_balance_anchor_chan": "0",
    "account_balance": {
        "default": {
            "confirmed_balance": "33268",
            "unconfirmed_balance": "0"
        }
    }
}

real	0m0.678s
user	0m0.102s
sys	0m0.073s

$ time ll getinfo
^C[lncli] rpc error: code = Canceled desc = context canceled

real	3m5.953s
user	0m0.087s
sys	0m0.095s

(I ctrl+c to make it stop)

$ time ll listchannels
^C[lncli] rpc error: code = Canceled desc = context canceled

real	3m28.391s
user	0m0.083s
sys	0m0.159s

(I ctrl+c to make it stop)

LND.CONF:

[Application Options]
debuglevel=info
maxpendingchannels=10
alias=*******************
color=#ffffff

rpclisten=0.0.0.0:10009
listen=0.0.0.0:9735
restlisten=0.0.0.0:8001

[Bitcoin]
bitcoin.active=1
bitcoin.testnet=0
bitcoin.mainnet=1
bitcoin.node=bitcoind

; The base fee in millisatoshi we will charge for forwarding payments on our
; channels.
bitcoin.basefee=5

; The fee rate used when forwarding payments on our channels. The total fee
; charged is basefee + (amount * feerate / 1000000), where amount is the
; forwarded amount.
bitcoin.feerate=1

[Bitcoind]
bitcoind.rpchost=bitcoind:8332
bitcoind.rpcuser=*****************************
bitcoind.rpcpass=***********************
bitcoind.zmqpubrawblock=tcp://bitcoind:28332
bitcoind.zmqpubrawtx=tcp://bitcoind:28333

; Fee estimate mode for bitcoind. It must be either "ECONOMICAL" or "CONSERVATIVE".
; If unset, the default value is "CONSERVATIVE".
bitcoind.estimatemode=CONSERVATIVE

[tor]
tor.active=true
tor.v3=true
tor.streamisolation=false
tor.skip-proxy-for-clearnet-targets=false
tor.socks=tor:9050
tor.control=tor:9051
tor.password=*********************
tor.targetipaddress=10.5.0.6

[wtclient]

; Activate Watchtower Client. To get more information or configure watchtowers
; run `lncli wtclient -h`.
wtclient.active=true

; Specify the fee rate with which justice transactions will be signed. This fee
; rate should be chosen as a maximum fee rate one is willing to pay in order to
; sweep funds if a breach occurs while being offline. The fee rate should be
; specified in sat/vbyte.
wtclient.sweep-fee-rate=30

[db]
db.backend=postgres

[postgres]
; Postgres connection string.
; Default:
;   db.postgres.dsn=
; Example:
db.postgres.dsn=postgres://postgres:********************************************@pgsql:5432/lnd?sslmode=disable

; Postgres connection timeout. Valid time units are {s, m, h}. Set to zero to
; disable.
db.postgres.timeout=3m

; Postgres maximum number of connections. Set to zero for unlimited. It is
; recommended to set a limit that is below the server connection limit.
; Otherwise errors may occur in lnd under high-load conditions.
; Default:
db.postgres.maxconnections=10

Bitcoind seems ok:

$ btc getblockchaininfo
{
  "chain": "main",
  "blocks": 808403,
  "headers": 808403,
  "bestblockhash": "0000000000000000000314d5ce9e52d2677c08d3dc8a617690a13897203f53fa",
  "difficulty": 54150142369480,
  "time": 1695107587,
  "mediantime": 1695104573,
  "verificationprogress": 0.9999999672514687,
  "initialblockdownload": false,
  "chainwork": "000000000000000000000000000000000000000056098a6c3da24d6620caa504",
  "size_on_disk": 581069847788,
  "pruned": false,
  "warnings": ""
}

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Reactions: 1
  • Comments: 67 (10 by maintainers)

Most upvoted comments

I changed the OS from my Raspi 4 from Raspian 64bit to Ubuntu Server 23 64bit.

It seems Raspbian 64bit uses the 32bit docker and armhf architecture when fetching and building apps, instead of arm64.

Changing to Ubuntu 23 64bit solved this issue, and so far, LND is no longer clogged.

So in sum, Raspberry 4 can’t run LND in 32 bits with stability.

I am starting to think that LND is just not compatible with Docker, no matter the bits.

No, the issue here is that you rely on Docker, which because of the OS was only installed as 32bit and therefore could only run 32bit applications. RaspiBlitz has been using the 64bit version of lnd for a long time, as RaspiBlitz doesn’t use Docker to run things. There are other known issues with lnd running on 32bit architectures, even before Postgres. And we’ve been recommending to not use 32bit architectures anymore. Though maybe not explicitly enough.

IIUC, the OP is also running Docker in 32 bit mode as well, which may contribute to some trashing that can slow things down. As mentioned above, if you’re running everything on a single machine, without any replication at all, then sqlite is a better fit for you hardware configuration.

Zooming out: you’re seeing postgres hand on commit, this isn’t related to lnd, as it needs to wait for a commit to finish before it can proceed. As is, we only have a single write commit going at any given time. https://github.com/lightningnetwork/lnd/pull/7992 will enable multiple write transactions to commit at a time.

You should also attempt to increase bitcoind.blockpollinginterval to something like bitcoind.blockpollinginterval=5m which will reduce load.

But everything is pointing out to some strange issue in the way LND uses postgres. Sorry if I can’t be of better help for now.

lnd just exports everything as a key-value store, which we know isn’t optimal. Future PRs will be able to take better advantage server side query filtering. At this point, what we know is happening is that postgres is blocking on a long running operation, if postgres blocks, then lnd does due to the current application level transaction locking.

If you constantly restart lnd, you are more likely to hit a slow getinfo as it will load the mempool during startup.

@FeatureSpitter thanks for the goroutine dump, that’s very helpful indeed.

@Roasbeef I think we might have another mutext locking issue here, possibly amplified by the single write lock of Postgres.

Here’s the full dump: goroutinedump.txt

What looks suspicious:

--> 8 peers seem to be stuck here:

goroutine 714932 [semacquire, 61 minutes]:
sync.runtime_Semacquire(0x4232f90)
	runtime/sema.go:62 +0x3c
sync.(*WaitGroup).Wait(0x4232f88)
	sync/waitgroup.go:116 +0x68
github.com/lightningnetwork/lnd/peer.(*Brontide).WaitForDisconnect(0x4232d80, 0x5122e80)
	github.com/lightningnetwork/lnd/peer/brontide.go:1102 +0xd8
github.com/lightningnetwork/lnd.(*server).peerTerminationWatcher(0x4372b60, 0x4232d80, 0x5122e80)
	github.com/lightningnetwork/lnd/server.go:4017 +0x74
created by github.com/lightningnetwork/lnd.(*server).peerInitializer in goroutine 714931
	github.com/lightningnetwork/lnd/server.go:3973 +0x114

--> 19 goroutines seem to be stuck here:

goroutine 2264025 [sync.RWMutex.RLock, 30 minutes]:
sync.runtime_SemacquireRWMutexR(0x4372bcc, 0x0, 0x0)
	runtime/sema.go:82 +0x3c
sync.(*RWMutex).RLock(0x4372bc0)
	sync/rwmutex.go:71 +0x58
github.com/lightningnetwork/lnd.(*server).BroadcastMessage(0x4372b60, 0x0, {0x1fe16f8, 0x0, 0x0})
	github.com/lightningnetwork/lnd/server.go:3287 +0x2c
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).sendLocalBatch(0xcf32310, {0x54e4000, 0x0, 0x6e})
	github.com/lightningnetwork/lnd/discovery/gossiper.go:1286 +0x70
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).splitAndSendAnnBatch.func2()
	github.com/lightningnetwork/lnd/discovery/gossiper.go:1260 +0x198
created by github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).splitAndSendAnnBatch in goroutine 50911
	github.com/lightningnetwork/lnd/discovery/gossiper.go:1251 +0x200


--> Waiting for `p.startReady`:

goroutine 2262974 [select, 30 minutes]:
github.com/lightningnetwork/lnd/peer.(*Brontide).Disconnect(0xeeb6b40, {0x1575da0, 0xdb130c0})
	github.com/lightningnetwork/lnd/peer/brontide.go:1115 +0x94
github.com/lightningnetwork/lnd.(*server).removePeer(0x4372b60, 0xeeb6b40)
	github.com/lightningnetwork/lnd/server.go:4309 +0x100
github.com/lightningnetwork/lnd.(*server).InboundPeerConnected(0x4372b60, {0x1583294, 0x15632960})
	github.com/lightningnetwork/lnd/server.go:3586 +0x6f0
created by github.com/btcsuite/btcd/connmgr.(*ConnManager).listenHandler in goroutine 714406
	github.com/btcsuite/btcd@v0.23.5-0.20230905170901-80f5a0ffdf36/connmgr/connmanager.go:507 +0x250

Looking closer, I think this might actually be because goroutine 714108 is holding the unique Postgres write lock but is waiting for an answer:

goroutine 714108 [runnable]:
syscall.Syscall(0x4, 0x3b, 0x810e800, 0x43)
	syscall/syscall_linux.go:69 +0x8
syscall.write(0x3b, {0x810e800, 0x43, 0x400})
	syscall/zsyscall_linux_arm.go:949 +0x48
syscall.Write(...)
	syscall/syscall_unix.go:209
internal/poll.ignoringEINTRIO(...)
	internal/poll/fd_unix.go:736
internal/poll.(*FD).Write(0x11648780, {0x810e800, 0x43, 0x400})
	internal/poll/fd_unix.go:380 +0x304
net.(*netFD).Write(0x11648780, {0x810e800, 0x43, 0x400})
	net/fd_posix.go:96 +0x38
net.(*conn).Write(0x1eb5aa98, {0x810e800, 0x43, 0x400})
	net/net.go:191 +0x48
github.com/jackc/pgconn.(*PgConn).execExtendedSuffix(0x7bd0000, {0x810e800, 0x2d, 0x400}, 0x7bd0054)
	github.com/jackc/pgconn@v1.14.0/pgconn.go:1153 +0x138
github.com/jackc/pgconn.(*PgConn).ExecPrepared(0x7bd0000, {0x1580210, 0x12f3dea0}, {0x11a29240, 0xe}, {0x154a2840, 0x1, 0x4}, {0x115a11a0, 0x1, ...}, ...)
	github.com/jackc/pgconn@v1.14.0/pgconn.go:1108 +0x118
github.com/jackc/pgx/v4.(*Conn).Query(0x114e8580, {0x1580210, 0x12f3dea0}, {0x64218c0, 0x56}, {0x479fa70, 0x2, 0x2})
	github.com/jackc/pgx/v4@v4.18.1/conn.go:655 +0xc38
github.com/jackc/pgx/v4/stdlib.(*Conn).QueryContext(0xb19bcc0, {0x1580210, 0x12f3dea0}, {0x64218c0, 0x56}, {0x21152870, 0x1, 0x1})
	github.com/jackc/pgx/v4@v4.18.1/stdlib/sql.go:436 +0x330
database/sql.ctxDriverQuery({0x1580210, 0x12f3dea0}, {0xef3c0cb8, 0xb19bcc0}, {0x0, 0x0}, {0x64218c0, 0x56}, {0x21152870, 0x1, ...})
	database/sql/ctxutil.go:48 +0x94
database/sql.(*DB).queryDC.func1()
	database/sql/sql.go:1748 +0x178
database/sql.withLock({0x157ce68, 0x7d725a0}, 0x6db5a1c)
	database/sql/sql.go:3502 +0x84
database/sql.(*DB).queryDC(0x42b0bd0, {0x1580210, 0x12f3dea0}, {0x15801d0, 0x82d30b0}, 0x7d725a0, 0xaec03e0, {0x64218c0, 0x56}, {0x6db5b40, ...})
	database/sql/sql.go:1743 +0x174
database/sql.(*Tx).QueryContext(0xdba5bc0, {0x1580210, 0x12f3dea0}, {0x64218c0, 0x56}, {0x6db5b40, 0x1, 0x1})
	database/sql/sql.go:2497 +0xbc
database/sql.(*Tx).QueryRowContext(...)
	database/sql/sql.go:2515
github.com/lightningnetwork/lnd/kvdb/sqlbase.(*readWriteTx).QueryRow(0x15ace3a8, {0x64218c0, 0x56}, {0x6db5b40, 0x1, 0x1})
	github.com/lightningnetwork/lnd/kvdb@v1.4.4/sqlbase/readwrite_tx.go:180 +0x60
github.com/lightningnetwork/lnd/kvdb/sqlbase.(*readWriteCursor).Next(0x6db5b80)
	github.com/lightningnetwork/lnd/kvdb@v1.4.4/sqlbase/readwrite_cursor.go:93 +0x1f8
github.com/lightningnetwork/lnd/kvdb/sqlbase.(*readWriteBucket).ForEach(0x5130c10, 0x1fe64c60)
	github.com/lightningnetwork/lnd/kvdb@v1.4.4/sqlbase/readwrite_bucket.go:62 +0x7c
github.com/lightningnetwork/lnd/channeldb.(*ChannelGraph).pruneGraphNodes(0x448bf80, {0x158597c, 0x5130b20}, {0x158597c, 0x5130c10})
	github.com/lightningnetwork/lnd/channeldb/graph.go:1458 +0x194
github.com/lightningnetwork/lnd/channeldb.(*ChannelGraph).PruneGraphNodes.func1({0x15844f4, 0x15ace3a8})
	github.com/lightningnetwork/lnd/channeldb/graph.go:1410 +0x11c
github.com/lightningnetwork/lnd/kvdb/sqlbase.(*db).executeTransaction.func2()
	github.com/lightningnetwork/lnd/kvdb@v1.4.4/sqlbase/db.go:311 +0x30
github.com/lightningnetwork/lnd/kvdb/sqlbase.catchPanic(0x6db5cb0)
	github.com/lightningnetwork/lnd/kvdb@v1.4.4/sqlbase/db.go:196 +0x68
github.com/lightningnetwork/lnd/kvdb/sqlbase.(*db).executeTransaction(0x439a5c0, 0x14ce83e0, 0x13dcd34, 0x0)
	github.com/lightningnetwork/lnd/kvdb@v1.4.4/sqlbase/db.go:311 +0x144
github.com/lightningnetwork/lnd/kvdb/sqlbase.(*db).Update(0x439a5c0, 0x14ce83e0, 0x13dcd34)
	github.com/lightningnetwork/lnd/kvdb@v1.4.4/sqlbase/db.go:227 +0x34
github.com/lightningnetwork/lnd/kvdb.Update(...)
	github.com/lightningnetwork/lnd/kvdb@v1.4.4/interface.go:16
github.com/lightningnetwork/lnd/channeldb.(*ChannelGraph).PruneGraphNodes(0x448bf80)
	github.com/lightningnetwork/lnd/channeldb/graph.go:1396 +0x84
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).pruneZombieChans(0x5622840)
	github.com/lightningnetwork/lnd/routing/router.go:990 +0x76c
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler(0x5622840)
	github.com/lightningnetwork/lnd/routing/router.go:1232 +0xbc4
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).Start in goroutine 1
	github.com/lightningnetwork/lnd/routing/router.go:689 +0x944

So not sure if things being locked on the server’s main mutex is a problem in itself or only really possible if there is only a single DB writer possible.

@FeatureSpitter can you please do the following:

  • Check the postgres log if you see anything suspicious that would explain why a single query is taking so long?
  • Try reducing the db.postgres.timeout=3m value to something smaller, e.g. 1m
  • Try increasing the db.postgres.maxconnections=10 value (but check the actually configured value in Postgres first, that must be at least as large as the number you’re using in lnd)