lnd: Postgres replication discrepancy after filesystem failure, choose correct dump to restore
Background
btrfs raid1 filesystem (with db replication on third ssd) is failing, node has been shutdown after disabling forwards and making sure of no pending htlcs; waiting for SSD replacements to arrive; dump of main postgres instance and of replica are slightly different, which one to restore on new node?
This is the output of diff:
diff -u /replica/sqldump/replica.dump /replica/sqldump/main.dump
--- /replica/sqldump/replica.dump
+++ /replica/sqldump/main.dump
@@ -982092,7 +982092,7 @@
-- Name: channeldb_kv_id_seq; Type: SEQUENCE SET; Schema: public; Owner: bitcoin
--
-SELECT pg_catalog.setval('public.channeldb_kv_id_seq', 15758297, true);
+SELECT pg_catalog.setval('public.channeldb_kv_id_seq', 15758280, true);
--
@@ -982106,7 +982106,7 @@
-- Name: macaroondb_kv_id_seq; Type: SEQUENCE SET; Schema: public; Owner: bitcoin
--
-SELECT pg_catalog.setval('public.macaroondb_kv_id_seq', 33, true);
+SELECT pg_catalog.setval('public.macaroondb_kv_id_seq', 3, true);
--
@@ -982127,7 +982127,7 @@
-- Name: walletdb_kv_id_seq; Type: SEQUENCE SET; Schema: public; Owner: bitcoin
--
-SELECT pg_catalog.setval('public.walletdb_kv_id_seq', 20948, true);
+SELECT pg_catalog.setval('public.walletdb_kv_id_seq', 20919, true);
In the above, the DB replica seems to be ahead of main instance in regards of counters; to be noted, the main instance resided on the failing btrfs raid1 filesystem, while replicating instance mounted on /replica
resides on a third ext4 ssd which shows no filesystem failures.
When restoring node on new btrfs raid, should I import dump from main instance, or replica instance (whith higher counters, hence apparently “more updated”)?
Your environment
- version of
lnd
14.2 - which operating system (
uname -a
on *Nix) raspios 64bit - version of
btcd
,bitcoind
, or other backend - 22 - any other relevant environment details //
Steps to reproduce
I wish you not to reproduce this, will give you bad headaches.
Expected behaviour
Both dumps are identical
Actual behaviour
Slight differences in dump.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 16 (8 by maintainers)
To complete this thread, which I will be closing at the same time: my node was restored today from the main postgresql instance’s dump, and no force closures happened; I have now synchronous replication in place.
I saw that the difference in your sequence numbers is indeed < 32. If that is because of pre-allocation, the difference should be inconsequential and it shouldn’t matter which copy your restore. But I am no expert on postgres internals.
I will take care of that, it’s a oneliner in a doc that I already PRed once after all 😃
No, if your node force-closes with an old state, your counterparty has the revocation key for the time locked funds and can take your balance as well.
lnd
will do its best to not force close if it detects (is informed by the remove node) that there is a mismatch of the channel state. But you shouldn’t rely on that alone, things can still go wrong.