lnd: Health check: chain backend failed after 3 calls. How to avoid the shutdown?

Background

My LND node keeps shutting down. Seems like it’s triggered by a short network failure preventing the node from reaching bitcoind

Your environment

  • lnd version 0.11.99-beta commit=v0.11.0-beta-255-g10a84f2c75c9de958c9dd07de830f0546a43d05a
  • which operating system: Debian GNU/Linux 10 4.19.0-8-amd64
  • version of bitcoind: Bitcoin Core version v0.20.1
  • lnd and bitocoind are on different servers

Steps to reproduce

Restart LND. Repros every 1 to 2 days.

Expected behaviour

Just keep running.

Actual behaviour

2020-10-03 02:18:26.258 [INF] NTFN: New block: height=651006, sha=000000000000000000059e4ba6945311b11cf30cea4389ca932ebd0c89f5186c
2020-10-03 02:18:26.258 [INF] UTXN: Attempting to graduate height=651006: num_kids=0, num_babies=0
2020-10-03 02:18:58.221 [INF] CRTR: Processed channels=0 updates=554 nodes=1 in last 59.999815896s

2020-10-03 02:19:24.910 [CRT] SRVR: Health check: chain backend failed after 3 calls  <------------

2020-10-03 02:19:24.910 [INF] SRVR: Sending request for shutdown

Question

How to avoid the shutdown?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 30 (12 by maintainers)

Most upvoted comments

There seems to be a problem with the health check, some false positives we are investigating. For now, you can disable the health check with --healthcheck.chainbackend.attempts=0.

Bitcoin Core acquires a lock (cs_main) at the start of every interesting RPC call (getbestblockhash as an example). The RPC call “uptime” does not acquire this lock, and it is extremely fast on my machine (without any hiccups).

Thanks for the info @C-Otto! We suspected something like this would be the case, but didn’t actually go look in core itself. I’m pretty surprised that it blocks for minutes though! I’ll switchover to a call that doesn’t need the lock (just need one which we have for neutrino/btcd as well, but not critical that it’s the same endpoint imo). Will get this in for 0.12.

I forgot make install, and just verified that “uptime” is used with the current version. Thanks for the quick help 😃

It’s reasonable because bitcoind cannot be reached and it seems plausible to me because in different kinds of software, I’ve seen client timeouts when there is I/O issue on the client host.

Sure, but being unreachable for 2 minutes? The code here is also pretty simple: send the request, then wait for it to come back. I don’t think the healthchecks themselves are causing high I/O as it’s just an RPC call, and lnd makes several of these to bitcoind on a normal basis for routing operation.

Is there any reason not to have the default at 120 seconds?

@alevchuk no reason not to bump the default. It does seem bizarre to me that the call is taking 10 seconds to complete, so we’re still investigating to make sure nothing is wrong on our side. If we can’t figure anything out, will likely bump the default and see how that goes.

I increased the healthcheck.chainbackend.attempts to 10 and healthcheck.chainbackend.timeout to 30s and will keep an eye on it to see if it happens again

Thanks for the report @sendbitcoin! I’d strongly recommend running lnd with debuglevel=HLCK=debug so that we can see the reason your check is failing. The logging was only bumped to info level in 35a2dbc so you may not see the reason otherwise.

I’m getting this issue. lnd has shutdown twice already because of this. backend is bitcoin core 0.18.1 on the same computer.

2020-10-08 08:09:51.192 [CRT] SRVR: Health check: chain backend failed after 3 calls 2020-10-08 08:09:51.192 [INF] SRVR: Sending request for shutdown 2020-10-08 08:09:51.193 [INF] LTND: Received shutdown request. 2020-10-08 08:09:51.193 [INF] LTND: Shutting down… 2020-10-08 08:09:51.193 [INF] LTND: Gracefully shutting down.

I increased the healthcheck.chainbackend.attempts to 10 and healthcheck.chainbackend.timeout to 30s and will keep an eye on it to see if it happens again