lnd: [bug]: commitment transaction dips peer below chan reserve: internal error causing eclair to FC

Background

For unknown reason (possible racecondition) my LND instance identified that a downstream HTLC commitment would have caused a channel peer balance going negative, and below our chan reserve for that channel. That caused my LND to send my peer an internal error. @DerEwige running Eclair, which immediately going on-chain when receiving an internal error.

Two things to validate:

  • the negative balance might happen quite often, for instance when > 1 HTLCs opened at the same time. Why would this cause an internal error and not just a failing HTLC?
  • could we identify a different error message, being more verbose, to avoid other implementations interpreting this as a major issue and going on-chain?

Your environment

  • version of lnd: lnd version 0.16.2-beta commit=v0.16.2-beta
  • which operating system (uname -a on *Nix): Linux debian-nuc 5.10.0-20-amd64 #1 SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux
  • version of btcd, bitcoind, or other backend: Bitcoin Core version v24.0.1
  • any other relevant environment details: LND latest release AMD64 binary

Steps to reproduce

Log excerpt (full log grep of channel-id and pubkey attached)

2023-05-01 01:48:09.421 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 239283580 mSAT is larger than 234681168 mSAT
2023-05-01 01:48:17.098 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 766964171 mSAT is larger than 746420614 mSAT
2023-05-01 01:49:35.142 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 74920943 mSAT is larger than 11413159 mSAT
2023-05-01 01:53:07.050 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 63151008 mSAT is larger than 11413159 mSAT
2023-05-01 01:59:08.683 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 21053936 mSAT is larger than 11413159 mSAT
2023-05-01 01:59:50.614 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 207614045 mSAT is larger than 11413159 mSAT
2023-05-01 02:02:31.406 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 50011450 mSAT is larger than 11413159 mSAT
2023-05-01 02:02:47.147 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 50011450 mSAT is larger than 11413159 mSAT
2023-05-01 02:03:09.108 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 20556850 mSAT is larger than 11413159 mSAT
2023-05-01 02:04:24.172 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 11863529 mSAT is larger than 11413159 mSAT
2023-05-01 02:06:55.592 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): insufficient bandwidth to route htlc: 73006424 mSAT is larger than 11413159 mSAT
2023-05-01 02:08:36.525 [ERR] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): failing link: unable to update commitment: commitment transaction dips peer below chan reserve: our balance below chan reserve with error: internal error
2023-05-01 02:08:36.525 [ERR] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): link failed, exiting htlcManager
2023-05-01 02:08:36.525 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): exited
2023-05-01 02:08:36.525 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): stopping
2023-05-01 02:08:55.257 [INF] PEER: Peer(023631624e30ef7bcb2887e600da8e59608a093718bc40d35b7a57145a0f3db9af): Loading ChannelPoint(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0), isPending=false
2023-05-01 02:08:55.257 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): starting
2023-05-01 02:08:55.257 [INF] CNCT: Attempting to update ContractSignals for ChannelPoint(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0)
2023-05-01 02:08:55.257 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): HTLC manager started, bandwidth=0 mSAT
2023-05-01 02:08:55.257 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): attempting to re-synchronize
2023-05-01 02:11:54.862 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): stopping
2023-05-01 02:11:54.862 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): error when syncing channel states: link shutting down
2023-05-01 02:11:54.862 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): exited
2023-05-01 02:12:04.983 [INF] PEER: Peer(023631624e30ef7bcb2887e600da8e59608a093718bc40d35b7a57145a0f3db9af): Loading ChannelPoint(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0), isPending=false
2023-05-01 02:12:04.983 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): starting
2023-05-01 02:12:04.983 [INF] CNCT: Attempting to update ContractSignals for ChannelPoint(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0)
2023-05-01 02:12:04.983 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): HTLC manager started, bandwidth=0 mSAT
2023-05-01 02:12:04.984 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): attempting to re-synchronize
2023-05-01 02:15:03.445 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): stopping
2023-05-01 02:15:03.445 [WRN] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): error when syncing channel states: link shutting down
[....]
2023-05-01 02:40:20.520 [INF] CNCT: Attempting to update ContractSignals for ChannelPoint(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0)
2023-05-01 02:40:20.520 [INF] HSWC: ChannelLink(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): attempting to re-synchronize
2023-05-01 02:41:14.987 [INF] NTFN: Dispatching confirmed spend notification for outpoint=b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0, script=0 b0b3c81deb80ca8d88999cee148c5b2b5f1abef226d4d434139bdde0679f903e at current height=787714: 3daf8760142f17a9a5156be37a6da706deafe088c61e1b0275d313be666cb67b[0] spending b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0 at height=787714
2023-05-01 02:41:14.996 [INF] CNCT: Unilateral close of ChannelPoint(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0) detected
2023-05-01 02:41:14.997 [INF] CNCT: ChannelArbitrator(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): remote party has closed channel out on-chain
2023-05-01 02:41:28.618 [INF] CNCT: ChannelArbitrator(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): still awaiting contract resolution
2023-05-01 02:41:28.737 [INF] CNCT: *contractcourt.commitSweepResolver(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): Sweeping with witness type: CommitmentToRemoteConfirmed
2023-05-01 02:41:28.737 [INF] CNCT: *contractcourt.commitSweepResolver(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): sweeping commit output
2023-05-01 02:43:14.539 [INF] CNCT: *contractcourt.commitSweepResolver(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): local commitment output fully resolved by sweep tx: 6cf35d29f4170599173ed44cbd0bb5e8280699049447528c51405c906f416192
2023-05-01 02:43:14.578 [INF] CNCT: ChannelArbitrator(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): a contract has been fully resolved!
2023-05-01 02:43:14.578 [INF] CNCT: ChannelArbitrator(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0): still awaiting contract resolution
2023-05-01 02:59:20.520 [ERR] PEER: Peer(023631624e30ef7bcb2887e600da8e59608a093718bc40d35b7a57145a0f3db9af): Channel(b7e5622b25741ec8a5349feffe52059c58a0ae4e6dbf9ad466e6acfaab28c3ef:0) request enabling failed due to inactive link

Expected behaviour

Proposal above, details better evaluated by folks who’re more acustomed to cross-implementation error relays.

  • let the HTLC fail with usual process
  • don’t send internal error

Actual behaviour

Sending an internal-error causing eclair to force-close speedupln-fc.log

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 40 (11 by maintainers)

Commits related to this issue

Most upvoted comments

Just to add a little bit more color:

  1. We send an HTLC to the peer, this calls AddHTLC. This will call validateCommitmentSanity for the remote commitment with their updates on our commitment transaction (lc.localCommitChain.tail().theirMessageIndex) and all of our updates (lc.localUpdateLog.logIndex). It’ll also call validateCommitmentSanity for the local commit with all of their updates (lc.remoteUpdateLog.logIndex) and all of our updates (lc.localUpdateLog.logIndex)
  2. We sign the commit and send it over
  3. We receive an HTLC from the peer, calling ReceiveHTLC. This will call validateCommitmentSanity for the local commit with all the remote updates (lc.remoteUpdateLog.logIndex) and all of our updates that are on their lowest unrevoked commitment (we call this the tail – lc.remoteCommitChain.tail().ourMessageIndex). This doesn’t trigger the reserve check because both HTLC’s aren’t included as they haven’t yet revoked the tail commit
  4. They send a commit sig
  5. We send a revoke
  6. They send a revoke and a sig
  7. When we receive the revoke, we’ll attempt to sign a new commitment. The issue though is that we’ll now include both HTLC’s and fail the reserve check

There are probably a bunch of things we could do here – we could be more strict in ReceiveHTLC and include more HTLC’s in the call to validateCommitmentSanity (e.g. check all of our updates, all of their updates), but that could be too strict. I think the best thing here is to use the fee-buffer and not try to worry about involved update accounting. It won’t work for every case – maybe there’s a large fee spike and a concurrent update or maybe somebody sends a lot of HTLC’s in a single update round. But I think it should be good enough. I don’t know that it makes sense to change the error message because with LND-LND people might have to force close anyways since the channel gets stuck in a reestablish loop.

I have been able to reproduce the flow of messages that leads to the two channel states shared by @DerEwige. Both of them have exactly the same pattern, so I’ll only detail what happened to first one (with channel_id efc328abfaace666d49abf6d4eaea0589c0552feef9f34a5c81e74252b62e5b7).

First of all, an HTLC eclair -> lnd is cross-signed without any issues:

  Eclair                      LND
    |                          |
    | update_add_htlc          | htlc_id = 1_477_859
    |------------------------->|
    | commit_sig               |
    |------------------------->|
    |           revoke_and_ack |
    |<-------------------------|
    |               commit_sig |
    |<-------------------------|
    | revoke_and_ack           |
    |------------------------->|

At that point, both commitments are synchronized and contain only that HTLC.

Then eclair and lnd try adding an HTLC at the same time:

  Eclair                      LND
    |                          |
    | update_add_htlc          | htlc_id = 1_477_861
    |------------------------->|
    |          update_add_htlc | htlc_id = 120_896
    |<-------------------------|
    | commit_sig    commit_sig |
    |-----------+  +-----------|
    |            \/            |
    |            /\            |
    |<----------+  +---------->|
    |           revoke_and_ack |
    |<-------------------------|
    | revoke_and_ack           |
    |------------------------->|
    | commit_sig               |
    |------------------------->|
    |                    error |
    |<-------------------------|

Here are the details of the commitments when eclair receives lnd’s error message:

  • commitment feerate: 2500 sat/kw
  • channel reserve: 30_000 sat (same on both sides)
  • eclair’s commitment:
    • commitment number: 3_194_658
    • to_local: 2_836_315_792 msat
    • to_remote: 34_367_703 msat
    • HTLC OUT: id=1_477_859 amount=117_941_049 msat
    • HTLC IN: id=120_896 amount=11_375_456 msat
  • lnd’s commitment:
    • commitment number: 3_195_511
    • to_local: 45_743_159 msat
    • to_remote: 2_756_274_986 msat
    • HTLC IN: id=1_477_859 amount=117_941_049 msat
    • HTLC IN: id=1_477_861 amount=80_040_806 msat
  • lnd’s next commitment (signed by eclair, previous not revoked yet):
    • commitment number: 3_195_512
    • to_local: 34_367_703 msat
    • to_remote: 2_756_274_986 msat
    • HTLC IN: id=1_477_859 amount=117_941_049 msat
    • HTLC IN: id=1_477_861 amount=80_040_806 msat
    • HTLC OUT: id=120_896 amount=11_375_456 msat

You can check that in the files shared by @DerEwige.

The important thing to note is that there is nothing that makes either peer dip into their reserve. The difference between lnd’s current commitment and the next one (which seems to trigger the force-close) is an additional HTLC going from lnd to eclair. So it isn’t triggered by something eclair sent, lnd decided to add that HTLC (and it doesn’t make anyone dip into their reserve). It seems to me that this internal error sent by lnd shouldn’t be treated as an error, as there’s nothing wrong with the state of the commitments or the messages exchanged. @roasbeef can you (or someone else from lnd) confirm this? I’m very tempted to explicitly parse internal error messages and ignore them in eclair (like core-lightning does), as such bugs are very costly for our users who end up paying for a force-close.

@morehouse hehe I came here to link that very issue. AFAICT, it’s an edge case where we send an HTLC (as the initiator), can pay for that HTLC, but then the remote party sends another HTLC concurrently (they haven’t seen our yet), once that HTLC is ACK’d and a sig ready to go out, we fail as adding that would dip us below the reserve.

Sure, you can find that code here: https://github.com/ACINQ/eclair/blob/3547f87f664c5e956a6d13af530a3d1cb6fc1052/eclair-core/src/main/scala/fr/acinq/eclair/channel/Commitments.scala#L435

This function is called when trying to send an outgoing HTLC: if that HTLC violates some limit (such as this additional buffer), we simply don’t send it (and try another channel with the same peer, or if none work, propagate the failure to the corresponding upstream channel).

With the current protocol where each node immediately sends out a CommitSig (50ms) after sending out the UpdateAddHTLC we would also make the channel unusable because we need a revocation from the peer. So I think the feebuffer is our best shot for this problem?

Yes, but then this would be the third case.

The case we’ve covered in lnd,

  1. Alice sends update_add_htlc
  2. Bob receives Alice’s HTLC
  3. Bob sends update_add_htlc, which would fail because of the validateAddHtlc check.
  4. Alice would fail too due to the same check.

The case we should cover in lnd,

  1. Alice sends update_add_htlc
  2. Bob sends update_add_htlc
  3. Alice receives Bob’s HTLC, and should fail here
  4. Bob receives Alice’s HTLC, and should fail here

We should add a method validateReceiveHtlc similar to validateAddHtlc and use it in ReceiveHTLC.

The case cannot be fixed atm,

  1. Alice sends update_add_htlc
  2. Alice sends commitment_signed <- commits this HTLC on Alice’s remote chain
  3. Bob sends update_add_htlc
  4. Bob sends commitment_signed <- commits this HTLC to Bob’s remote chain
  5. Alice receives Bob’s HTLC and commit sig, but fails it due to dipping below channel reserve
  6. Bob receives Alice’s HTLC and commit sig, but fails it due to dipping below channel reserve

There are a few mitigations for this case,

  1. implement #7721 using @ziggie’s suggestion
  2. change IdealCommitFeeRate used here to return min relay fee for anchor channels and let the fee bumping be handled by CPFP.

@t-bast we had an initial work around which helped to spur that spec PR here. In practice, it looks like it wasn’t enough to fully address this edge case.

Indeed LND is missing the fee spike buffer (see https://github.com/lightningnetwork/lnd/issues/7721). We should implement that.

The fee spike buffer helps, but IIUC it doesn’t completely prevent concurrent HTLC problems. But even if an HTLC race occurs, why do we need to error and force close? If the next commitment would violate the channel reserve, can’t we just disconnect to drop the proposed updates? Then on reconnect we can do exponential backoff to prevent the race from reoccurring indefinitely.

Sure, you can find that code here: https://github.com/ACINQ/eclair/blob/3547f87f664c5e956a6d13af530a3d1cb6fc1052/eclair-core/src/main/scala/fr/acinq/eclair/channel/Commitments.scala#L435

Basically Eclair (I think also CLN) has a fee buffer of 2x the current commitment fee (the additional HTLC consideration will be consumed by the htlc which will be send out by the current payment request so I do not count it here (we already do this). Every new Incoming HTLC adds approx. 20% fees, so the buffer of 200% (2x current commitment fee) buys us 10 htlc slots (for concurrent htlcs).This increases when htlcs are on the channel (commitment tx bigger) which acts as a nice HTLC limiter when our balance is very low). I like this approach as well and would implement the additional feebuffer, let me know what you guys think 👻

This would basically implement this: https://github.com/lightning/bolts/pull/740 Especially:

The node _responsible_ for paying the Bitcoin fee should maintain a "fee
spike buffer" on top of its reserve to accommodate a future fee increase.
Without this buffer, the node _responsible_ for paying the Bitcoin fee may
reach a state where it is unable to send or receive any non-dust HTLC while
maintaining its channel reserve (because of the increased weight of the
commitment transaction), resulting in a degraded channel. See [#728](https://github.com/lightningnetwork/lightning-rfc/issues/728)
for more details.

what I am wondering tho, why is eclair signing this breach of the contract, also for eclair it must be known that the remote_commitment dips below reserve, do you actually log this even as a warning ?

Eclair would indeed have complained at a later step and sent an error.

Hmm we have a dust exposure check in place but this does only count for dust htlcs in the code base.

Sorry that was the wrong “additional reserve”, I meant this one: https://github.com/lightning/bolts/pull/740

As a potential fix I would add a safety net to the peer who opened the channel, lets say we always keep a buffer of 10*IncomingHTLC_Weight (173) * CFee (2500 sat/kweight) = 4325 sats (in this example with a CFee of 2500) additional buffer so that we would not fall below reserve if the peer added 10 incoming HTLCs in one revocation window.

Or we could include the remote_update_log also for unsigned updates by the remote in our estimation when we would like to add an HTLC to the remote chain? Currently unsigned updates by the remote are in our remote_log bucket as well but we only include Incoming HTLCs in our Commitment which our peer signed for (and we acked them). But we could leave space for them?

@t-bast I was verifying your numbers and figured out that you kinda neglected the commitment fees, the lnd node has to pay:

lnd’s next commitment (signed by eclair, previous not revoked yet): commitment number: 3_195_512 to_local: 34_367_703 msat to_remote: 2_756_274_986 msat HTLC IN: id=1_477_859 amount=117_941_049 msat HTLC IN: id=1_477_861 amount=80_040_806 msat HTLC OUT: id=120_896 amount=11_375_456 msat

cap is 3 mio, 34_367_703 + 2_756_274_986 + 117_941_049 + 80_040_806 + 11_375_456 = 3_000_000_000 msats

additional commitment fees: (3*173 + 1124) * 2500 sats/kweight = 4107,5 sats (in fees) + 660 sats (2 anchor values) = 4767,5 sats

Taking these costs into account the local balance falls below the reserve

to_local: 34_367_703 msat - 4767500 = 29600203 (which is below the reserve)

I have an interesting case possibly related to this issue.

Me: LND 0.16.4-beta My peer: LND 0.16.4-beta (Umbrel)

Both nodes are online and ok. the connection is established without any problems.

My log:

[ERR] HSWC: ChannelLink(xxx:1): failing link: ChannelPoint(xxx:1): received error from peer: chan_id=xxx, err=internal error with error: remote error
[ERR] HSWC: ChannelLink(xxx:1): link failed, exiting htlcManager

My peer log:

[ERR] HSWC: ChannelLink(xxx:1): failing link: unable to update commitment: commitment transaction dips peer below chan reserve: our balance below chan reserve with error: internal error
[ERR] HSWC: ChannelLink(xxx:1): link failed, exiting htlcManager

All of the liquidity in the channel is on my side, remote balance = 60,218 sats.

I’m ready to provide any necessary information.

Can anyone of us do something to resolve this problem? Should my peer use max-channel-fee-allocation=1 for temporary resolving it?

UPD: max-channel-fee-allocation=1 did not help, the channel was force-closed due to an expired HTLC.