lnd: [bug]: lncli getinfo and LND in general, getting stuck at COMMIT when using Postgres
There are no errors in the logs, other commands like walletbalance and getnetworkinfo return (they take a few seconds tho), but getinfo and listpeers and maybe others take forever to return… no error in the logs. I am using 0.17 RC3 on raspberry pi 4 8GB. The CPU is at 1% or less most of the time, with some spikes at 40-70% either from LND or Bitcoind…
Logs: https://pastebin.com/W10vaKgF
LND: lncli version 0.17.0-beta.rc3 commit=v0.17.0-beta.rc3-19-g9f4a8836d
OS: Linux raspberrypi 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux
Examples:
pi@raspberrypi:~ $ time ll walletbalance
{
"total_balance": "33268",
"confirmed_balance": "33268",
"unconfirmed_balance": "0",
"locked_balance": "0",
"reserved_balance_anchor_chan": "0",
"account_balance": {
"default": {
"confirmed_balance": "33268",
"unconfirmed_balance": "0"
}
}
}
real 0m0.678s
user 0m0.102s
sys 0m0.073s
$ time ll getinfo
^C[lncli] rpc error: code = Canceled desc = context canceled
real 3m5.953s
user 0m0.087s
sys 0m0.095s
(I ctrl+c to make it stop)
$ time ll listchannels
^C[lncli] rpc error: code = Canceled desc = context canceled
real 3m28.391s
user 0m0.083s
sys 0m0.159s
(I ctrl+c to make it stop)
LND.CONF:
[Application Options]
debuglevel=info
maxpendingchannels=10
alias=*******************
color=#ffffff
rpclisten=0.0.0.0:10009
listen=0.0.0.0:9735
restlisten=0.0.0.0:8001
[Bitcoin]
bitcoin.active=1
bitcoin.testnet=0
bitcoin.mainnet=1
bitcoin.node=bitcoind
; The base fee in millisatoshi we will charge for forwarding payments on our
; channels.
bitcoin.basefee=5
; The fee rate used when forwarding payments on our channels. The total fee
; charged is basefee + (amount * feerate / 1000000), where amount is the
; forwarded amount.
bitcoin.feerate=1
[Bitcoind]
bitcoind.rpchost=bitcoind:8332
bitcoind.rpcuser=*****************************
bitcoind.rpcpass=***********************
bitcoind.zmqpubrawblock=tcp://bitcoind:28332
bitcoind.zmqpubrawtx=tcp://bitcoind:28333
; Fee estimate mode for bitcoind. It must be either "ECONOMICAL" or "CONSERVATIVE".
; If unset, the default value is "CONSERVATIVE".
bitcoind.estimatemode=CONSERVATIVE
[tor]
tor.active=true
tor.v3=true
tor.streamisolation=false
tor.skip-proxy-for-clearnet-targets=false
tor.socks=tor:9050
tor.control=tor:9051
tor.password=*********************
tor.targetipaddress=10.5.0.6
[wtclient]
; Activate Watchtower Client. To get more information or configure watchtowers
; run `lncli wtclient -h`.
wtclient.active=true
; Specify the fee rate with which justice transactions will be signed. This fee
; rate should be chosen as a maximum fee rate one is willing to pay in order to
; sweep funds if a breach occurs while being offline. The fee rate should be
; specified in sat/vbyte.
wtclient.sweep-fee-rate=30
[db]
db.backend=postgres
[postgres]
; Postgres connection string.
; Default:
; db.postgres.dsn=
; Example:
db.postgres.dsn=postgres://postgres:********************************************@pgsql:5432/lnd?sslmode=disable
; Postgres connection timeout. Valid time units are {s, m, h}. Set to zero to
; disable.
db.postgres.timeout=3m
; Postgres maximum number of connections. Set to zero for unlimited. It is
; recommended to set a limit that is below the server connection limit.
; Otherwise errors may occur in lnd under high-load conditions.
; Default:
db.postgres.maxconnections=10
Bitcoind seems ok:
$ btc getblockchaininfo
{
"chain": "main",
"blocks": 808403,
"headers": 808403,
"bestblockhash": "0000000000000000000314d5ce9e52d2677c08d3dc8a617690a13897203f53fa",
"difficulty": 54150142369480,
"time": 1695107587,
"mediantime": 1695104573,
"verificationprogress": 0.9999999672514687,
"initialblockdownload": false,
"chainwork": "000000000000000000000000000000000000000056098a6c3da24d6620caa504",
"size_on_disk": 581069847788,
"pruned": false,
"warnings": ""
}
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Reactions: 1
- Comments: 67 (10 by maintainers)
I changed the OS from my Raspi 4 from Raspian 64bit to Ubuntu Server 23 64bit.
It seems Raspbian 64bit uses the 32bit docker and armhf architecture when fetching and building apps, instead of arm64.
Changing to Ubuntu 23 64bit solved this issue, and so far, LND is no longer clogged.
So in sum, Raspberry 4 can’t run LND in 32 bits with stability.
I am starting to think that LND is just not compatible with Docker, no matter the bits.
No, the issue here is that you rely on Docker, which because of the OS was only installed as 32bit and therefore could only run 32bit applications. RaspiBlitz has been using the 64bit version of
lnd
for a long time, as RaspiBlitz doesn’t use Docker to run things. There are other known issues withlnd
running on 32bit architectures, even before Postgres. And we’ve been recommending to not use 32bit architectures anymore. Though maybe not explicitly enough.IIUC, the OP is also running Docker in 32 bit mode as well, which may contribute to some trashing that can slow things down. As mentioned above, if you’re running everything on a single machine, without any replication at all, then
sqlite
is a better fit for you hardware configuration.Zooming out: you’re seeing postgres hand on commit, this isn’t related to
lnd
, as it needs to wait for a commit to finish before it can proceed. As is, we only have a single write commit going at any given time. https://github.com/lightningnetwork/lnd/pull/7992 will enable multiple write transactions to commit at a time.You should also attempt to increase
bitcoind.blockpollinginterval
to something likebitcoind.blockpollinginterval=5m
which will reduce load.lnd
just exports everything as a key-value store, which we know isn’t optimal. Future PRs will be able to take better advantage server side query filtering. At this point, what we know is happening is that postgres is blocking on a long running operation, if postgres blocks, thenlnd
does due to the current application level transaction locking.If you constantly restart
lnd
, you are more likely to hit a slowgetinfo
as it will load the mempool during startup.@FeatureSpitter thanks for the goroutine dump, that’s very helpful indeed.
@Roasbeef I think we might have another mutext locking issue here, possibly amplified by the single write lock of Postgres.
Here’s the full dump: goroutinedump.txt
What looks suspicious:
Looking closer, I think this might actually be because goroutine
714108
is holding the unique Postgres write lock but is waiting for an answer:So not sure if things being locked on the server’s main mutex is a problem in itself or only really possible if there is only a single DB writer possible.
@FeatureSpitter can you please do the following:
db.postgres.timeout=3m
value to something smaller, e.g.1m
db.postgres.maxconnections=10
value (but check the actually configured value in Postgres first, that must be at least as large as the number you’re using inlnd
)