lnd: HTLC not getting failed upstream after they timeout downstream

Background

I see quite a lot of htlcs timing out instead of failing on mainnet, which is quite annoying because that leads to channel closing. This could be normal if the counterparty was down, but in that particular case, I was pretty sure the peer was up and operational, so I tried to find out what happened.

Your environment

  • version of lnd: v0.5.0-beta

Steps to reproduce

Scenario is a payment being relayed through eclair and then through lnd:

… -----> eclair ----- htlcA -----> lnd ----- htlcB -----> …

htlcA lnd_channel_id=544814:1271:1 expiry=545073 htlc_id=7 htlcB lnd_channel_id=540823:2707:1 expiry=545069 htlc_id=15

From the point of view of eclair, nothing happens until block #545073 gets mined, at which point htlcA has timed out, causing eclair to unilaterally close the channel.

I was able to get lnd’s logs (log.zip, thanks @AndrewSamokhvalov), here is what I think are the relevant parts:

"October 9th 2018, 16:39:57.044",DBG,PEER,"Received UpdateAddHTLC(chan_id=b0d69f3c998bfabfc32702e55fbe0ff3e2551d91ca234c692e0fc1b0a1a8745e, id=7, amt=135635112 mSAT, expiry=545073, hash=7202da369b8dae66cb9ac6b0dec53bfe7e254efe53ebfb0d3ef6b6f1ba90d4bc) from 34.239.230.56:9735"
...
"October 9th 2018, 16:39:58.660",DBG,PEER,"Sending UpdateAddHTLC(chan_id=8c8481ae7e2a76af1691599057f2e4bafa7cad382f68eb73a9a1650ee410344d, id=15, amt=135634976 mSAT, expiry=545069, hash=7202da369b8dae66cb9ac6b0dec53bfe7e254efe53ebfb0d3ef6b6f1ba90d4bc) to 47.184.129.94:43178"
...
"October 9th 2018, 16:39:59.664",TRC,HSWC,"ChannelLink(540823:2707:1) revocation window exhausted, unable to send: 1, dangling_opens=([]channeldb.CircuitKey) (len=1 cap=1) {
 (channeldb.CircuitKey) (Chan ID=544814:1271:1, HTLC ID=7)
}
, dangling_closes([]channeldb.CircuitKey) {
}
(the above line is repeated ~100k times)
...
"October 9th 2018, 18:25:28.665",DBG,HSWC,"ChannelLink(540823:2707:1) removing Add packet (Chan ID=544814:1271:1, HTLC ID=7) from mailbox"

I think what happened is that the next peer in the route got unresponsive or something, and htlcB never got signed. I’m actually not sure if lnd sent a commit_sig for htlcB or not. It seems like lnd was waiting for a revoke_and_ack that never arrived, and after a while lnd just removed htlcB (not sure what exactly triggered this). So lnd didn’t watch htlcB’s expiry and didn’t fail htlcA between blocks 545069 and 545073, which caused htlcA to eventually timeout.

Expected behaviour

  • If lnd sent a commit_sig containing htlcB, it should have watched the blockchain, closed the downstream channel after block 545069 and failed htlcA.

  • If lnd didn’t send a commit_sig, then it should have fast-failed htlcA

In any case the issue being between lnd and the next node, it shouldn’t have caused the closing of the upstream channel.

Actual behaviour

lnd ignored htlcA, and htlcA eventually timed out.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 1
  • Comments: 17 (14 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks for the detailed report! We have a fix up that ensures we won’t let the incoming link time out unnecessarily.

Thanks, I just tested with my repro steps and it works nicely.

I’m not sure what is the status or priority of this, but it is currently the main cause of channel failures on our node, has been for quite some time (e.g. in february it happened 26 times so far).

That’s my point actually: in an A->B->C setting, if there is a problem between B and C (whatever the problem), it should never lead to the A-B channel getting closed, that’s what cltv_expiry_delta are for.

If the B-C link is in an indeterminate state (for w/e reason), and this persists for more than the delta, then the channel will indeed be closed. In this case, we can’t cancel that HTLC on the outgoing, as ti isn’t yet fully locked in and in a partial limbo state (they have a state but we don’t or the other way around).

If you close B-C, then once your or your peer’s commitment tx is confirmed, it won’t be in an indeterminate state anymore (HTLC either will or will not be in the commit tx) and you will be able to fail or fulfill it in A-B. All that happening within cltv_expiry_delta.