nebula: 🐛 BUG: Issues with multiple fixed IP addresses for lighthouse

What version of nebula are you using?

1.7.2

What operating system are you using?

Linux

Describe the Bug

I’m having issues with hosts connecting when they have multiple IPs set for their lighthouse/relay. Most things seem to work fine but I was running into issues with NAT to NAT connections across two networks using multiple fixed IP addresses for one of my lighthouses:

Just for reference, here’s what the connection looks like:

rsyslog daemon -> NAT -> internet -> NAT -> rsyslog server

The relevant part of my config looks like this:

static_host_map:
  172.0.0.2:
  - xxx.example.com:4242
  - xxx.xxx.245.206:4242
  - xxx.xxx.181.204:4242
  172.0.0.3:
  - xxx.xxx.118.198:4242

The rsyslog daemon on 172.0.3.109 shows this:

xxx.xxxlevel=info msg="Attempt to relay through hosts" localIndex=2357375276 relays="[172.0.0.2 172.0.0.3 172.0.0.2 172.0.0.3]" remoteIndex=0 vpnIp=172.0.2.116
xxx.xxxlevel=info msg="Send handshake via relay" localIndex=2357375276 relay=172.0.0.2 remoteIndex=0 vpnIp=172.0.2.116
xxx.xxxlevel=info msg="Send handshake via relay" localIndex=2357375276 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.2.116
xxx.xxxlevel=info msg="Send handshake via relay" localIndex=2357375276 relay=172.0.0.2 remoteIndex=0 vpnIp=172.0.2.116
xxx.xxxlevel=info msg="Send handshake via relay" localIndex=2357375276 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.2.116
xxx.xxxlevel=info msg="Handshake timed out" durationNs=3037758038 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2357375276 localIndex=2357375276 remoteIndex=0 udpAddrs="[xxx.xxx.142.18:53561 10.0.0.12:53561]" vpnIp=172.0.2.116

While the rsyslog server host 172.0.2.116 shows this:

xxx.xxxlevel=info msg="Attempt to relay through hosts" relayIps="[172.0.0.2 172.0.0.3 172.0.0.2 172.0.0.3]" vpnIp=172.0.3.109
xxx.xxxlevel=info msg="Re-send CreateRelay request" relay=172.0.0.2 vpnIp=172.0.3.109
xxx.xxxlevel=info msg="Re-send CreateRelay request" relay=172.0.0.3 vpnIp=172.0.3.109
xxx.xxxlevel=info msg="Re-send CreateRelay request" relay=172.0.0.2 vpnIp=172.0.3.109
xxx.xxxlevel=info msg="Re-send CreateRelay request" relay=172.0.0.3 vpnIp=172.0.3.109
xxx.xxxlevel=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2890926437 udpAddrs="[xxx.xxx.151.80:52928 xxx.xxx.151.80:65320 192.168.1.214:55504]" vpnIp=172.0.3.109

When I remove the xxx.example.com:4242 and xxx.xxx.181.204:4242 lines from both host’s static_host_map entry, the traffic flows.

Logs from affected hosts

see above

Config files from affected hosts

see above

About this issue

Most upvoted comments

Sorry for the confusion.

It sounds like you are able to form a handshake, even with this config in place, but that you expect the connection to die eventually?

Yes, that’s what I’m seeing. After running for a few weeks, Nebula on the rsyslog host seems to be able to connect out and no hosts can connect to the rsyslog server.

If that’s the case, it sounds like the scenario you’re describing now differs from the original issue: a failure to handshake versus a successful handshake with a connection that dies later.

It seems to be the second case, a successful handshake with a connection that dies later. But I think where the confusion is coming from is that I noticed both cases happening. Let’s wait for logs and then we can break this down further. I’ll make a long, detailed post once I have more info for you.

Thanks for your patience!

One more thing: some hosts on Network B could communicate with 172.0.2.116(my rsyslog server) while others could not (172.0.3.109 in this case). Those hosts that could not were not reachable by 172.0.2.116. Restarting Nebula on 172.0.2.116 and 172.0.3.109 fixed the communication issues for a few minutes.

So a host 172.0.2.116 (rsyslog) on Network A was able to communicate with some Network B hosts, but not 172.0.3.109 (unknown name) until Nebula was restarted on each of these hosts, at which time communication could be re-established?

Nebula has 10 slots for each of IPv4 and IPv6 addresses for a given host. A failure scenario for Nebula can occur when a host has more than 10 IP addresses, most of which are not routable (e.g. a bunch of Docker networks running on the host) and reports them to the Lighthouse. Nodes will query the Lighthouse and have no routable IP addresses. You can use local_allow_list to restrict which addresses / adapters are considered. You should be able to see through handshake logs that all of the udpAddrs are non-routable.

Restarting the affected hosts can temporarily solve the problem because they will re-send their IP addresses to the Lighthouse, possibly in a different order.

AFAICT, that issue would not be resolved or affected by having some hosts report to an extra Lighthouse.

This is all speculation without logs. Waiting with abated breath. 😃