lnd: [bug]: garbage collect `batch-replay` bucket in `sphinxreplay.db`
The sphinxreplay.db
on my node is larger than my channel.db
(2.2 GByte vs. 1.8 GByte), which seems odd. Please add a feature that reduces the size of sphinxreplay.db
.
(And, while we’re at it, please document what the file is actually used for: #6584)
- lnd v0.15.4-beta
- bitcoind v23
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 6
- Comments: 27 (17 by maintainers)
Commits related to this issue
- htlcswitch: remove unused batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: remove unused batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: remove unused batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: remove unused batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: remove unused batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: add garbage collection for batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: add garbage collection for batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: add garbage collection for batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: add garbage collection for batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: add garbage collection for batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: add garbage collection for batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: add garbage collection for batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: add garbage collection for batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: add garbage collection for batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: add garbage collection for batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: add garbage collection for batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
- htlcswitch: add garbage collection for batch-replay bucket fixes #7107 — committed to C-Otto/lnd by C-Otto 2 years ago
Background on replays and how LND handles them:
What is a replay and why does it matter?
When do replays no longer matter in lnd?
sharedHashes
bucket.Lifetime of a batch:
DecayedLog.PutBatch
. This happens whenDecodeHopIterators
inprocessRemoteAdds
. This may or may not be the first time the function has run.DecodeHopIterators
is called and the batch already exists, the set of replays are fetched and returned. The main purpose of the batch is so that if the link runsDecodeHopIterators
again, the set of replays for that forwarding package / batch can be retrieved. This allows the link to fail these HTLCs backwards.DecodeHopIterators
will never run for the batch ID.Forwarding packages are removed when:
FwdStateCompleted
inresolveFwdPkgs
. This might not occur depending on if other links are up and able to receive the add/settle/fail.fwdPkgGarbager
if in stateFwdStateCompleted
. This might not occur depending on if other links are up and able to receive the add/settle/fail.Packager.Wipe()
.The issue can be solved with a migration and with new non-migration code:
CloseChannel
One thing to note about the above approach is that it’s possible that the batches linger until the channel is closed due to forwarding packages not getting removed. However, there is nothing we can do about this because the link may call
DecodeHopIterators
again and the batch must exist or else things will break.Yeah, you’re right. That’s a lot of data. And I think I found the bug. We write to the
batch-replay
bucket here but we apparently never delete from it. That needs garbage collection as well. Though I’m not familiar with it enough to say whether that’s easy to fix or not.you’re right it needs to be persisted
If you look at this commit, tests are indeed there.
tx.ProcessOnionPacket
callsPutBatch
. Sure you can argue there should be a lower level tests there, but this were talking about a commit that’s 5 years old. If you wish, you can certainly add a lower level test here yourself (just grasping at straws here tbh).Looks like the issue is that when you revoke, you don’t always have new HTLCs to forward, however, this is called unconditionally: https://github.com/lightningnetwork/lnd/blob/f6ae63fd7b53e366f50a96216fb056dd8fb8f4a7/htlcswitch/link.go#L1990-L2049
So that should either just exit early in the func, or only be called if the number of adds is non zero. We can then add an assertion (that we never try to write an empty batch) to catch regressions in the future.
I would not recommend that as I’m not 100% sure of the security implications. I guess you could stop accepting new HTLCs for the number of blocks that correspond to your max channel’s CLTV delta value and then you had some confidence that a replay would fail because of wrong CLTV values. But again, I would not recommend just deleting the file.
OK, I think I got it. If we call
PutBatch
with two packetsa, b
twice (for whatever reason) where onlyb
is a replay, for both invocations we need to only returnb
as a replay. Without the bucket we’d checka
andb
individually and, for the second invocation, would return both as replays, as now alsoa
was already seen (in the first invocation).Yes, I saw your PR. But I don’t think we can delete the bucket, we need it to make sure a batch of packages wasn’t relayed before.
I have 612,489 pairs in
batch-replay
, 31,389 pairs inshared-hash
.I’m going to change the title to indicate this is a bug.
The Sphinx replay DB is
lnd
’s implementation of theReplayLog
defined in thesphinx
packet. It keeps track of onion packets to make sure they aren’t used in a replay attack. Each package is stored with its CLTV and is garbage collected as soon as the CLTV has expired. That means the DB is subject to a lot of reads and writes and therefore does profit from regular compaction.Can you try compacting it and see if it still stays as as it currently is?
Maybe another option (after compaction) would be to take a look at it with
boltbrowser
(shut downlnd
first, create a copy of the DB file, startlnd
again, inspect copy) and check whether there are expired hashes in there? Or maybe there’s some data stored in there in addition to the hashes themselves that aren’t properly garbage collected? I’ll take a look at my DB as well to see if I can identify anything.