lnd: [bug]: commitment transaction dips peer below chan reserve: internal error causing eclair to FC
Background
For unknown reason (possible racecondition) my LND instance identified that a downstream HTLC commitment would have caused a channel peer balance going negative, and below our chan reserve for that channel.
That caused my LND to send my peer an internal error
. @DerEwige running Eclair, which immediately going on-chain when receiving an internal error
.
Two things to validate:
- the negative balance might happen quite often, for instance when > 1 HTLCs opened at the same time. Why would this cause an
internal error
and not just a failing HTLC? - could we identify a different error message, being more verbose, to avoid other implementations interpreting this as a major issue and going on-chain?
Your environment
- version of
lnd
:lnd version 0.16.2-beta commit=v0.16.2-beta
- which operating system (
uname -a
on *Nix):Linux debian-nuc 5.10.0-20-amd64 #1 SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux
- version of
btcd
,bitcoind
, or other backend:Bitcoin Core version v24.0.1
- any other relevant environment details: LND latest release AMD64 binary
Steps to reproduce
Log excerpt (full log grep of channel-id and pubkey attached)
2023-05-01 01:48:09.421 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 239283580 mSAT is larger than 234681168 mSAT
2023-05-01 01:48:17.098 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 766964171 mSAT is larger than 746420614 mSAT
2023-05-01 01:49:35.142 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 74920943 mSAT is larger than 11413159 mSAT
2023-05-01 01:53:07.050 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 63151008 mSAT is larger than 11413159 mSAT
2023-05-01 01:59:08.683 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 21053936 mSAT is larger than 11413159 mSAT
2023-05-01 01:59:50.614 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 207614045 mSAT is larger than 11413159 mSAT
2023-05-01 02:02:31.406 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 50011450 mSAT is larger than 11413159 mSAT
2023-05-01 02:02:47.147 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 50011450 mSAT is larger than 11413159 mSAT
2023-05-01 02:03:09.108 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 20556850 mSAT is larger than 11413159 mSAT
2023-05-01 02:04:24.172 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 11863529 mSAT is larger than 11413159 mSAT
2023-05-01 02:06:55.592 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 73006424 mSAT is larger than 11413159 mSAT
2023-05-01 02:08:36.525 [ERR] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): failing link: unable to update commitment: commitment transaction dips peer below chan reserve: our balance below chan reserve with error: internal error
2023-05-01 02:08:36.525 [ERR] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): link failed, exiting htlcManager
2023-05-01 02:08:36.525 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): exited
2023-05-01 02:08:36.525 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): stopping
2023-05-01 02:08:55.257 [INF] PEER: Peer(023631624e30ef7bcb2887e600da8e59608a093718bc40d35b7a57145a0f3db9af): Loading ChannelPoint(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0), isPending=false
2023-05-01 02:08:55.257 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): starting
2023-05-01 02:08:55.257 [INF] CNCT: Attempting to update ContractSignals for ChannelPoint(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0)
2023-05-01 02:08:55.257 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): HTLC manager started, bandwidth=0 mSAT
2023-05-01 02:08:55.257 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): attempting to re-synchronize
2023-05-01 02:11:54.862 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): stopping
2023-05-01 02:11:54.862 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): error when syncing channel states: link shutting down
2023-05-01 02:11:54.862 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): exited
2023-05-01 02:12:04.983 [INF] PEER: Peer(023631624e30ef7bcb2887e600da8e59608a093718bc40d35b7a57145a0f3db9af): Loading ChannelPoint(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0), isPending=false
2023-05-01 02:12:04.983 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): starting
2023-05-01 02:12:04.983 [INF] CNCT: Attempting to update ContractSignals for ChannelPoint(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0)
2023-05-01 02:12:04.983 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): HTLC manager started, bandwidth=0 mSAT
2023-05-01 02:12:04.984 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): attempting to re-synchronize
2023-05-01 02:15:03.445 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): stopping
2023-05-01 02:15:03.445 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): error when syncing channel states: link shutting down
[....]
2023-05-01 02:40:20.520 [INF] CNCT: Attempting to update ContractSignals for ChannelPoint(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0)
2023-05-01 02:40:20.520 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): attempting to re-synchronize
2023-05-01 02:41:14.987 [INF] NTFN: Dispatching confirmed spend notification for outpoint=b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0, script=0 b0b3c81deb80ca8d88999cee148c5b2b5f1abef226d4d434139bdde0679f903e at current height=787714: 3daf8760142f17a9a5156be37a6da706deafe088c61e1b0275d313be666cb67b[0] spending b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0 at height=787714
2023-05-01 02:41:14.996 [INF] CNCT: Unilateral close of ChannelPoint(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0) detected
2023-05-01 02:41:14.997 [INF] CNCT: ChannelArbitrator(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): remote party has closed channel out on-chain
2023-05-01 02:41:28.618 [INF] CNCT: ChannelArbitrator(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): still awaiting contract resolution
2023-05-01 02:41:28.737 [INF] CNCT: *contractcourt.commitSweepResolver(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): Sweeping with witness type: CommitmentToRemoteConfirmed
2023-05-01 02:41:28.737 [INF] CNCT: *contractcourt.commitSweepResolver(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): sweeping commit output
2023-05-01 02:43:14.539 [INF] CNCT: *contractcourt.commitSweepResolver(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): local commitment output fully resolved by sweep tx: 6cf35d29f4170599173ed44cbd0bb5e8280699049447528c51405c906f416192
2023-05-01 02:43:14.578 [INF] CNCT: ChannelArbitrator(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): a contract has been fully resolved!
2023-05-01 02:43:14.578 [INF] CNCT: ChannelArbitrator(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): still awaiting contract resolution
2023-05-01 02:59:20.520 [ERR] PEER: Peer(023631624e30ef7bcb2887e600da8e59608a093718bc40d35b7a57145a0f3db9af): Channel(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0) request enabling failed due to inactive link
Expected behaviour
Proposal above, details better evaluated by folks who’re more acustomed to cross-implementation error relays.
- let the HTLC fail with usual process
- don’t send
internal error
Actual behaviour
Sending an internal-error
causing eclair to force-close
speedupln-fc.log
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 40 (11 by maintainers)
Commits related to this issue
- Ignore lnd's internal errors It seems like lnd sends this error whenever something wrong happens on their side, regardless of whether the channel actually needs to be closed. We ignore it to avoid pa... — committed to ACINQ/eclair by t-bast a year ago
- Ignore lnd's internal errors (#2659) It seems like lnd sends this error whenever something wrong happens on their side, regardless of whether the channel actually needs to be closed. We ignore it t... — committed to ACINQ/eclair by t-bast a year ago
Just to add a little bit more color:
AddHTLC
. This will callvalidateCommitmentSanity
for the remote commitment with their updates on our commitment transaction (lc.localCommitChain.tail().theirMessageIndex
) and all of our updates (lc.localUpdateLog.logIndex
). It’ll also callvalidateCommitmentSanity
for the local commit with all of their updates (lc.remoteUpdateLog.logIndex
) and all of our updates (lc.localUpdateLog.logIndex
)ReceiveHTLC
. This will callvalidateCommitmentSanity
for the local commit with all the remote updates (lc.remoteUpdateLog.logIndex
) and all of our updates that are on their lowest unrevoked commitment (we call this the tail –lc.remoteCommitChain.tail().ourMessageIndex
). This doesn’t trigger the reserve check because both HTLC’s aren’t included as they haven’t yet revoked the tail commitThere are probably a bunch of things we could do here – we could be more strict in
ReceiveHTLC
and include more HTLC’s in the call tovalidateCommitmentSanity
(e.g. check all of our updates, all of their updates), but that could be too strict. I think the best thing here is to use the fee-buffer and not try to worry about involved update accounting. It won’t work for every case – maybe there’s a large fee spike and a concurrent update or maybe somebody sends a lot of HTLC’s in a single update round. But I think it should be good enough. I don’t know that it makes sense to change the error message because with LND-LND people might have to force close anyways since the channel gets stuck in a reestablish loop.I have been able to reproduce the flow of messages that leads to the two channel states shared by @DerEwige. Both of them have exactly the same pattern, so I’ll only detail what happened to first one (with
channel_id
efc328abfaace666d49abf6d4eaea0589c0552feef9f34a5c81e74252b62e5b7
).First of all, an HTLC
eclair -> lnd
is cross-signed without any issues:At that point, both commitments are synchronized and contain only that HTLC.
Then
eclair
andlnd
try adding an HTLC at the same time:Here are the details of the commitments when
eclair
receiveslnd
’serror
message:You can check that in the files shared by @DerEwige.
The important thing to note is that there is nothing that makes either peer dip into their reserve. The difference between
lnd
’s current commitment and the next one (which seems to trigger the force-close) is an additional HTLC going fromlnd
toeclair
. So it isn’t triggered by somethingeclair
sent,lnd
decided to add that HTLC (and it doesn’t make anyone dip into their reserve). It seems to me that thisinternal error
sent by lnd shouldn’t be treated as anerror
, as there’s nothing wrong with the state of the commitments or the messages exchanged. @roasbeef can you (or someone else fromlnd
) confirm this? I’m very tempted to explicitly parseinternal error
messages and ignore them ineclair
(like core-lightning does), as such bugs are very costly for our users who end up paying for a force-close.@morehouse hehe I came here to link that very issue. AFAICT, it’s an edge case where we send an HTLC (as the initiator), can pay for that HTLC, but then the remote party sends another HTLC concurrently (they haven’t seen our yet), once that HTLC is ACK’d and a sig ready to go out, we fail as adding that would dip us below the reserve.
Sure, you can find that code here: https://github.com/ACINQ/eclair/blob/3547f87f664c5e956a6d13af530a3d1cb6fc1052/eclair-core/src/main/scala/fr/acinq/eclair/channel/Commitments.scala#L435
This function is called when trying to send an outgoing HTLC: if that HTLC violates some limit (such as this additional buffer), we simply don’t send it (and try another channel with the same peer, or if none work, propagate the failure to the corresponding upstream channel).
Yes, but then this would be the third case.
The case we’ve covered in
lnd
,update_add_htlc
update_add_htlc
, which would fail because of thevalidateAddHtlc
check.The case we should cover in
lnd
,update_add_htlc
update_add_htlc
We should add a method
validateReceiveHtlc
similar tovalidateAddHtlc
and use it inReceiveHTLC
.The case cannot be fixed atm,
update_add_htlc
commitment_signed
<- commits this HTLC on Alice’s remote chainupdate_add_htlc
commitment_signed
<- commits this HTLC to Bob’s remote chainThere are a few mitigations for this case,
IdealCommitFeeRate
used here to return min relay fee for anchor channels and let the fee bumping be handled by CPFP.@t-bast we had an initial work around which helped to spur that spec PR here. In practice, it looks like it wasn’t enough to fully address this edge case.
Indeed LND is missing the fee spike buffer (see https://github.com/lightningnetwork/lnd/issues/7721). We should implement that.
The fee spike buffer helps, but IIUC it doesn’t completely prevent concurrent HTLC problems. But even if an HTLC race occurs, why do we need to error and force close? If the next commitment would violate the channel reserve, can’t we just disconnect to drop the proposed updates? Then on reconnect we can do exponential backoff to prevent the race from reoccurring indefinitely.
Basically Eclair (I think also CLN) has a fee buffer of 2x the current commitment fee (the additional HTLC consideration will be consumed by the htlc which will be send out by the current payment request so I do not count it here (we already do this). Every new Incoming HTLC adds approx. 20% fees, so the buffer of 200% (2x current commitment fee) buys us 10 htlc slots (for concurrent htlcs).This increases when htlcs are on the channel (commitment tx bigger) which acts as a nice HTLC limiter when our balance is very low). I like this approach as well and would implement the additional feebuffer, let me know what you guys think 👻
This would basically implement this: https://github.com/lightning/bolts/pull/740 Especially:
Eclair would indeed have complained at a later step and sent an
error
.Sorry that was the wrong “additional reserve”, I meant this one: https://github.com/lightning/bolts/pull/740
As a potential fix I would add a safety net to the peer who opened the channel, lets say we always keep a buffer of 10*IncomingHTLC_Weight (173) * CFee (2500 sat/kweight) = 4325 sats (in this example with a CFee of 2500) additional buffer so that we would not fall below reserve if the peer added 10 incoming HTLCs in one revocation window.
Or we could include theremote_update_log
also for unsigned updates by the remote in our estimation when we would like to add an HTLC to the remote chain? Currently unsigned updates by the remote are in our remote_log bucket as well but we only include Incoming HTLCs in our Commitment which our peer signed for (and we acked them). But we could leave space for them?@t-bast I was verifying your numbers and figured out that you kinda neglected the commitment fees, the lnd node has to pay:
cap is 3 mio, 34_367_703 + 2_756_274_986 + 117_941_049 + 80_040_806 + 11_375_456 = 3_000_000_000 msats
additional commitment fees: (3*173 + 1124) * 2500 sats/kweight = 4107,5 sats (in fees) + 660 sats (2 anchor values) = 4767,5 sats
Taking these costs into account the local balance falls below the reserve
to_local: 34_367_703 msat - 4767500 = 29600203 (which is below the reserve)
I have an interesting case possibly related to this issue.
Me: LND 0.16.4-beta My peer: LND 0.16.4-beta (Umbrel)
Both nodes are online and ok. the connection is established without any problems.
My log:
My peer log:
All of the liquidity in the channel is on my side, remote balance = 60,218 sats.
I’m ready to provide any necessary information.
Can anyone of us do something to resolve this problem? Should my peer use
max-channel-fee-allocation=1
for temporary resolving it?UPD:
max-channel-fee-allocation=1
did not help, the channel was force-closed due to an expired HTLC.