lnd: [bug]: Force closes for commitment not revoked, probable SQLite management issue

Background

This is but the last occurrence of the issue, but the only one where the peer sent out a logs snippet. I am constantly online, lnd runs on a beefy 64GB ECC Xeon server with OpenZFS mirrored partition across 6 SSD’s. payments-expiration-grace-period=110000h in lnd.conf My rebalance script, using rebalance-lnd as backend, is firing away, limited only to one outgoing htlc for each channel (won’t even start if there are other pending outgoing htlcs on the same channel). Peer will publish a unilateral close on me.

For what is worth, this is my second node, and it basically works just the same as my previous one, lnd.conf is the same except the alias and color, I added max-channel-fee-allocation=1 as suggested by Ziggie, and only other differences from my previous node are SQLite backend instead of postgres, cltv set to 210 instead of 144, and a MUCH beefier hardware vs an old laptop. Another difference is that I never got FCs like these on previous node. Noteworthy, I had disabled rebalances altogether on the old node during the april and may mempool madness, and never resumed it, but up until march I was rebalancing like crazy there as well.

Relevant logs from peer:

[ERR] HSWC: ChannelLink(2c60f2eea9e6283ae45ce963f0848f272f0cbb47d2c6360ffc8ce8f5ba49a6c8:0): failing link: ChannelPoint(2c60f2eea9e6283ae45ce963f0848f272f0cbb47d2c6360ffc8ce8f5ba49a6c8:0): unable to accept new commitment: not enough HTLC signatures with error: invalid commitment
[ERR] HSWC: ChannelLink(2c60f2eea9e6283ae45ce963f0848f272f0cbb47d2c6360ffc8ce8f5ba49a6c8:0): link failed, exiting htlcManager
[INF] HSWC: ChannelLink(2c60f2eea9e6283ae45ce963f0848f272f0cbb47d2c6360ffc8ce8f5ba49a6c8:0): exited
[INF] HSWC: ChannelLink(2c60f2eea9e6283ae45ce963f0848f272f0cbb47d2c6360ffc8ce8f5ba49a6c8:0): stopping
[INF] HSWC: Removing channel link with ChannelID(c8a649baf5e88cfc0f36c6d247bb0c2f278f84f063e95ce43a28e6a9eef2602c)
[WRN] PEER: Peer(039e05e271f537cfa1c060d2364b960b85bd509ac89bae524e4a01948a07b3e8d1): Force closing link(799966:2401:0)
[INF] CNCT: Attempting to force close ChannelPoint(2c60f2eea9e6283ae45ce963f0848f272f0cbb47d2c6360ffc8ce8f5ba49a6c8:0)
[INF] NANN: Announcing channel(2c60f2eea9e6283ae45ce963f0848f272f0cbb47d2c6360ffc8ce8f5ba49a6c8:0) disabled [requested]
[INF] CNCT: ChannelArbitrator(2c60f2eea9e6283ae45ce963f0848f272f0cbb47d2c6360ffc8ce8f5ba49a6c8:0): force closing chan

FC tx: db847d85d362835c6ade2eb5f9d10f909f702e481d721b4e6432e75e9c482566

The 9_504 output should correlate to a 9_500 outgoing rebalance htlc started about 40 minutes before the FC, which probably had a 4sat routing fee, at least that’s what I find in my rebalance script log.

Your environment

version of lnd 0.16.4
which operating system (uname -a on *Nix) 5.10.0-23-amd64 #1 SMP Debian 5.10.179-2 (2023-07-14) x86_64 GNU/Linux
version of btcd, bitcoind, or other backend bitcoind
any other relevant environment details SQLite backend

Steps to reproduce

Dunno… have my node, rebalance?

Expected behaviour

Not getting FC by peer

Actual behaviour

Getting FC by peer.

About this issue

Original URL
State: closed
Created a year ago
Reactions: 1
Comments: 16 (1 by maintainers)

Most upvoted comments

Re SQLITE_BUSY, the current logic just sets a value, but then doesn’t actually try re-execute the transaction before reporting the error back to the caller. The new SQL scaffolding we have prepped for the upcoming SQL invoice schema introduction properly attempts to retry once we get a busy error: https://github.com/lightningnetwork/lnd/blob/bf5aab9d5246daede3f499af880a5d7be6c6ee88/sqldb/interfaces.go#L219-L231. We should unify the abstractions here so we can have all the serialization error retries in a single place.

is SQLite implementation experimental?

Usage of the sqlite backend requires a build tag. There also intentionally isn’t a first-class migration script one can run. bbolt (even with all its issues) should be considered the most stable DB backend. We plan to take strides to making the SQL backends first-class citizens through the end of this year.

So his setup fails here: https://github.com/lightningnetwork/lnd/blob/master/htlcswitch/link.go#L1969-L1973.

DB handling aside, I think the root issue here is that any db failure (w.r.t operations) should trigger a tear down of the link, similar to stalled commitments. If the logic halts there and doesn’t proceed (rn it’ll continue with that unclean state), then I think the issue here would’ve been averted all together.

Roasbeef on Aug 8, 2023