lnd: [bug]: `unable to read message from peer: EOF`, disconnections and routing stops

Background

It will happen, after N time, that my node starts having channels deactivated to several peers, one after the other, to be reactivated in a few minutes time tops; at the same time, it will become unable to route HTLCs, as I see no more forwards being notified by bos bot, and one of my peers who tracks these things reports a spike in pending outgoing HTLCs from their node to mine, whenever this happens, that will slowly resolve themselves by failing. Restarting lnd solves the issue, until next time this happens.

I couldn’t make solid hypotheses about why this happens, but here’s all the details that I can provide so you maybe have some ideas of your own. I run sqlite backend, and increased timeout to 10m to avoid SQLITE_BUSY errors. I don’t remember this error happening before, but I am 90% sure it started after I began connecting to more peers other than my direct channel ones, to get gossip updates faster from the network (this is before I knew about active sync peers and passive sync peers, before I was connecting to many peers which were all passive, I later on caught up and increased my active peers value, but all of this doesn’t seem to have had any influence on the issue). What I seemed to notice, other than seeing this problem arise after I increased the number of peers I connect to, is that the more peers I have, the sooner this happens. Using persistent connections or not doesn’t appear to change anything. I attached a log for one node which I picked among the ones my nodes detected disconnections to this last time. I had increased PEER loglevel to debug, and zgrep’d logs for its pubkey. I have since then restored info loglevel for everything.

I have disabled, for the time being, my script that connects to more peers, to be bale to report what happens in the upcoming days.

rocket.log

Your environment

  • version of lnd 0.16.4
  • which operating system (uname -a on *Nix) 5.10.0-26-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux
  • version of btcd, bitcoind, or other backend 24.1
  • any other relevant environment details sqlite backend, 12-core Xeon with 64GB of ECC RAM and 6-ssd zpool mirror pool

Steps to reproduce

Have sqlite backend (no idea if necessary), have an active routing node with 40something channels, connect to many peers (above 300 for faster mishap) with lncli connect <pubkey>@<address>:<port>

Expected behaviour

lnd continues operating normally, managing forwards like a champ

Actual behaviour

channels are disconnected at random, htlcs are not being processed

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 16 (2 by maintainers)

Most upvoted comments

“0.17=unsafe to use yet”.

This is an overreaction IMO. There is nothing inherently “unsafe” with 0.17 release.

I am sorry if I caused any hard feelings, that’s just my MO with lnd versions and I know now that several plebs are running on it. I will consider 17.1 later though

It would be great if you could encourage other operators to report those problems here as well if they were not reported yet. Could you give some goroutine dumps like described here? These files would be helpful: curl http://localhost:PORT/debug/pprof/goroutine?debug=0 > goroutine_0.prof curl http://localhost:PORT/debug/pprof/goroutine?debug=2 > goroutine_2.prof

I remember people talking about ungraceful shutdowns in 0.17 and possibly other things in the plebnet groups, but cannot pinpoint anything precise, I was just left with the knowledge of “0.17=unsafe to use yet”.

Regarding the profiling, I have the port enabled already, but I guess it would be useful only when the issue presents itself again, right?

It would be great if you could encourage other operators to report those problems here as well if they were not reported yet. Could you give some goroutine dumps like described here? These files would be helpful: curl http://localhost:PORT/debug/pprof/goroutine?debug=0 > goroutine_0.prof curl http://localhost:PORT/debug/pprof/goroutine?debug=2 > goroutine_2.prof