lnd: wtclient: "tower not found" on startup leads to a shut down of LND.
Background
I added a watchtower some time ago and this tower is not reachable anymore. As soon as I start LND with wtclient.active=true I run into the following issue:
lnd | 2021-12-19 10:58:42.656 [ERR] LTND: Shutting down because error in main method: unable to create server: tower not found
lnd | 2021-12-19 10:58:42.656 [ERR] LTND: error stopping tor controller: invalid arguments: unexpected code
lnd | unable to create server: tower not found
lnd | 2021-12-19 10:58:42.679 [INF] LTND: Shutdown complete lnd
lnd | 2021-12-19 10:58:43.394 [INF] LTND: Version: 0.14.1-beta commit=v0.14.1-beta, build=production, logging=default, debuglevel=info
If I disable the wtclient I can’t add new watchtowers or remove the current one. So I’m stuck in this situation and cannot activate the wtclient. Is there anything I can do in this situation?
Your environment
lnd: v0.14.1-beta
OS: Linux 1fb9819b8e0b 5.4.0-91-generic #102-Ubuntu
bitcoind: * version of btcd, bitcoind, or other backend
Steps to reproduce
- Add a watchtower via wtclient.
- Turn of the watchtower
- Restart LND
Expected behaviour
Actual behaviour
LND cannot start, because it does not find the tower anymore. There is no possibility to add a new or remove the tower because lnd shuts down after starting up.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 22 (2 by maintainers)
Hey Y’all! I just want to give some context about the PR I have opened for this issue and what exactly it does & how it should affect the users experiencing this issue. The issue is 2 fold in my mind because we need to fix 1) The issue that caused this state and 2) We need to help users who are currently in this state to recover.
The issue
The issue is that users are unable to startup their LND nodes due to an error (
"tower not found") being thrown during the construction of the watchtower client. What happens on start up of the client is that first all the previously created sessions are loaded into memory. Each session references a tower that it was created with. The constructer uses the list of sessions to extract the a list of Towers that we have used in the past. This list is then used to query the DB for the towers. If no tower with the given ID (ie, the one referenced by the session) is found in the DB then this error is thrown.To me, this issue screams race-condition. The only scenario that I could think of that would cause this issue is that somewhere, we are removing a tower just before inserting the first session for that tower. The code forbids us from removing a tower if there are any existing sessions for that tower in the db which is why I say that it is only for the towers first session that this could happen.
What the PR does
ClientSessionwith itsTower. As the code currently stands, on creation of a newClientSession, there is no check that the Tower that the session refers to actually exists in the db. This is fixed in the PR.Other thoughts
While I hope that part 1 of the PR prevents this issue from happening again, I think there might still be a possibility of some other race condition happening somewhere. I think this because users who ran in to this issue have said that they never tried to remove the tower in the past - which would debunk my theory of this issue happening when a user tries to remove a tower just as a first session with that tower is getting created.
Yaaaaaaaaayyy!!! Great team work 🚀
Thanks for the ping, excELLEnce ! I can confirm wtclient is working again. It has been a pleasure to be part of this “bug report”
Hi @reivanen - the RC is out if you wanna give it a go 😃
Thanks for your exemplary fast work, i’m looking forward to test the next rc!
@reivanen - the PR with the fix has been merged and will be included in the next rc
Managed to reproduce & to put together a fix. PR incoming. Thanks again @reivanen!
I had a problem of being unable to remove a watchtower, so i updated to 0.16-rc2 to fix it https://github.com/lightningnetwork/lnd/pull/6972, and instead of only being unable to remove the watchtower, i am now unable to start lnd as per the error in this bug report.
A session is created with a tower in order to send a certain number of updates (i think the default is 1024 updates per session)
because we need to know which sessions we can continue using to send more update. Also need to keep track of which updates we have successfully backed up.
I solved the problem with a workaround: Rename the .lnd/data/graph/mainnet/wtclient.db file as .lnd/data/graph/mainnet/wtclient.db.old Once done that, I start the lnd service, and it worked
There should be some more information to be able to determine the error, and some command or option to force the start even if the WT is not responding.
Thank you so much for explaining your process so well @ellemouton
Same here, I just restarted my node. Tower has been added through CLI. Problem occurs after a restart.
I can confirm this workaround by @decentralizedb