lnd: wtclient: "tower not found" on startup leads to a shut down of LND.

Background

I added a watchtower some time ago and this tower is not reachable anymore. As soon as I start LND with wtclient.active=true I run into the following issue:

lnd | 2021-12-19 10:58:42.656 [ERR] LTND: Shutting down because error in main method: unable to create server: tower not found
lnd | 2021-12-19 10:58:42.656 [ERR] LTND: error stopping tor controller: invalid arguments: unexpected code 
lnd | unable to create server: tower not found 
lnd | 2021-12-19 10:58:42.679 [INF] LTND: Shutdown complete lnd 
lnd | 2021-12-19 10:58:43.394 [INF] LTND: Version: 0.14.1-beta commit=v0.14.1-beta, build=production, logging=default, debuglevel=info

If I disable the wtclient I can’t add new watchtowers or remove the current one. So I’m stuck in this situation and cannot activate the wtclient. Is there anything I can do in this situation?

Your environment

lnd: v0.14.1-beta OS: Linux 1fb9819b8e0b 5.4.0-91-generic #102-Ubuntu bitcoind: * version of btcd, bitcoind, or other backend

Steps to reproduce

  1. Add a watchtower via wtclient.
  2. Turn of the watchtower
  3. Restart LND

Expected behaviour

Actual behaviour

LND cannot start, because it does not find the tower anymore. There is no possibility to add a new or remove the tower because lnd shuts down after starting up.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 22 (2 by maintainers)

Most upvoted comments

Hey Y’all! I just want to give some context about the PR I have opened for this issue and what exactly it does & how it should affect the users experiencing this issue. The issue is 2 fold in my mind because we need to fix 1) The issue that caused this state and 2) We need to help users who are currently in this state to recover.

The issue

The issue is that users are unable to startup their LND nodes due to an error ("tower not found") being thrown during the construction of the watchtower client. What happens on start up of the client is that first all the previously created sessions are loaded into memory. Each session references a tower that it was created with. The constructer uses the list of sessions to extract the a list of Towers that we have used in the past. This list is then used to query the DB for the towers. If no tower with the given ID (ie, the one referenced by the session) is found in the DB then this error is thrown.

To me, this issue screams race-condition. The only scenario that I could think of that would cause this issue is that somewhere, we are removing a tower just before inserting the first session for that tower. The code forbids us from removing a tower if there are any existing sessions for that tower in the db which is why I say that it is only for the towers first session that this could happen.

What the PR does

  1. The first thing the PR does is to more closely couple the ClientSession with its Tower. As the code currently stands, on creation of a new ClientSession, there is no check that the Tower that the session refers to actually exists in the db. This is fixed in the PR.
  2. The second thing that the PR does is that instead of the constructor first reading all the sessions from the DB and then using those to gather a list of towers, it now instead loads all the towers first and then for each tower, fetches the sessions we have with those towers. This should help users who are in this “cant start due to tower-not-found” state to start up again.

Other thoughts

While I hope that part 1 of the PR prevents this issue from happening again, I think there might still be a possibility of some other race condition happening somewhere. I think this because users who ran in to this issue have said that they never tried to remove the tower in the past - which would debunk my theory of this issue happening when a user tries to remove a tower just as a first session with that tower is getting created.

Yaaaaaaaaayyy!!! Great team work 🚀

Hi @reivanen - the RC is out if you wanna give it a go 😃

Thanks for the ping, excELLEnce ! I can confirm wtclient is working again. It has been a pleasure to be part of this “bug report”

Hi @reivanen - the RC is out if you wanna give it a go 😃

Thanks for your exemplary fast work, i’m looking forward to test the next rc!

@reivanen - the PR with the fix has been merged and will be included in the next rc

Managed to reproduce & to put together a fix. PR incoming. Thanks again @reivanen!

I had a problem of being unable to remove a watchtower, so i updated to 0.16-rc2 to fix it https://github.com/lightningnetwork/lnd/pull/6972, and instead of only being unable to remove the watchtower, i am now unable to start lnd as per the error in this bug report.

2023-03-08 08:24:41.466 [ERR] LTND: Shutting down because error in main method: unable to create server: tower not found 2023-03-08 08:24:41.470 [INF] LTND: Shutdown complete

unable to create server: tower not found

Is session a persistent value, which remains the same as long as a client keeps backing up to a Tower? Or every time an update is posted to the Tower a new session value is created? If the latter, your approach to load the towers first, appears sound to me.

A session is created with a tower in order to send a certain number of updates (i think the default is 1024 updates per session)

It would be good to understand session <> tower relationship and why is there a need to maintain session values in DB?

because we need to know which sessions we can continue using to send more update. Also need to keep track of which updates we have successfully backed up.

I am also having this problem, and the lnd service doesn’t start. Some help, it is serious that it doesn’t let start the node service due to lnd is unable to find the WT

any help? could I disable the WT service from the config file and that would help? I don’t want to do it, lest that cause a bigger problem

lnd v0.14.2-beta OS: Linux Ubuntu 20.04.4 LTS bitcoin core 22.0.0

trace error: [ERR] LTND: Shutting down because error in main method: unable to create server: tower not found 2022-04-19 23:39:48.721 [INF] TORC: Stopping tor controller 2022-04-19 23:39:48.723 [DBG] TORC: removing serviceID: from tor controller 2022-04-19 23:39:48.723 [DBG] TORC: sendCommand:DEL_ONION got err:unexpected code, reply:Bad arguments to DEL_ONION: Need at least 1 argument(s) 2022-04-19 23:39:48.724 [ERR] TORC: DEL_ONION got error: invalid arguments: unexpected code 2022-04-19 23:39:48.724 [ERR] LTND: error stopping tor controller: invalid arguments: unexpected code 2022-04-19 23:39:48.731 [INF] LTND: Shutdown complete

I solved the problem with a workaround: Rename the .lnd/data/graph/mainnet/wtclient.db file as .lnd/data/graph/mainnet/wtclient.db.old Once done that, I start the lnd service, and it worked

There should be some more information to be able to determine the error, and some command or option to force the start even if the WT is not responding.

Thank you so much for explaining your process so well @ellemouton

Same here, I just restarted my node. Tower has been added through CLI. Problem occurs after a restart.

I can confirm this workaround by @decentralizedb