lighthouse: Beacon Node: Unable to recover from network fragmentation

Description

Given is a custom beacon-chain testnet with nodes in physical distinct network locations [A, B]. The nodes have both an ENR supplied in a testnet directory specified as boot-enr.yaml and (due to inability to sufficiently network through ENR) a multi-address command line flag.

The ENR file looks like this: gist/d6eea3ea3356e41bde81864143284ce9#file-4-boot_enr-yaml

The multi-address look like this:

--libp2p-addresses /ip4/51.158.190.99/tcp/9000,/ip4/87.180.203.227/tcp/9000

Version

~/.opt/lighthouse master*
❯ git log -n1
commit f6a6de2c5d9e6fe83b6ded24bad93615f98e2163 (HEAD -> master, origin/master, origin/HEAD)
Author: Sacha Saint-Leger <sacha.saint-leger@mail.mcgill.ca>
Date:   Mon Mar 23 09:21:53 2020 +0100

    Become a Validator guides: update (#928)

~/.opt/lighthouse master*
❯ rustc --version
rustc 1.42.0 (b8cedc004 2020-03-09)

Present Behaviour

In case there is a network fragmentation between A and B, the nodes do not attempt to reconnect to each other. Both nodes know about each other both from the ENR and the multi-address format.

Furthermore, if both A and B are validators, there happens a chain split with two different head slots.

It’s possible to reconnect the nodes by restarting the beacon chain nodes manually, however, the chains of A and B are now unable to reorganize to the best head and the peers ban each other.

Mar 23 10:53:27.534 ERRO Disconnecting and banning peer          timeout: 30s, peer_id: PeerId("16Uiu2HAkxE6kBjfoGtSfhSJAE8oib6h3gM972pAj9brmthTHuuP2"), service: network
Mar 23 10:53:27.612 ERRO Disconnecting and banning peer          timeout: 30s, peer_id: PeerId("16Uiu2HAmPz5xFwZfY4CYN6fxf3Yz6LQaDfzCUrA5qwCoTHoCSiNR"), service: network
Mar 23 10:53:38.000 WARN Low peer count                          peer_count: 1, service: slot_notifier

In this case, deleting either A’s or B’s beacon chain and temporarily stopping the validators is the only way to recover from this issue.

Expected Behaviour

The beacon chain node should aggressively try to maintain connections. It should keep trying to resolve ENR and multi-addresses even after a disconnect.

I don’t know how it’s designed, but I imagine:

having higher timeouts to prevent disconnects in case of short network fragmentation
having repeated connection attempts even if previous attempts failed in case of longer or more severe network fragmentation

I have not enough understanding of consensus to have a suggestion for how to handle the reorganization, but having a more stable network would certainly help here.

Steps to resolve

Manual restart including reset of the beacon chain data directory.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 31 (31 by maintainers)

Most upvoted comments

I can no longer observe this. Networking seems to have improved in that regard. Happy to close this.

q9f on Apr 30, 2020

Ok. Thanks for investigating this. My testnet is now in a state where I can no longer recover it. So full success 🎉 😄

I will create a new testnet soon with better distribution of the validators but will keep the bootnodes running for a couple more days. Let me know if there’s anything else I can do to help.

q9f on Apr 1, 2020

I’m not aware of a different genesis root. I’ll updload the config to Github so we all are on the same genesis: https://github.com/goerli/schlesi

Sorry this was a typo, I should have said “genesis delay” not “genesis root”. Since I didn’t originally set the genesis delay to 160000 my genesis state had a different genesis time.

Edit, having second thoughts about the genesis delay. How does this even work? If each nodes picks a random time between MIN_GENESIS_DELAY and MIN_GENESIS_DELAY * 2 how can you be certain about the correct genesis event?

Instead of selecting randomly, we do this:

genesis_time = eth1_timestamp - eth1_timestamp % MIN_GENESIS_DELAY + 2 * MIN_GENESIS_DELAY

So, if MIN_GENESIS_DELAY is 24hrs, we wait until the next midnight (UTC) that is at least 24hrs away. So an eth1 timestamp at midday Tuesday ends up with an eth2 genesis on Wednesday/Thursday midnight.

paulhauner on Mar 29, 2020

I added a Teku node which is able to keep the network connected between the local and remote host even after the network fragmentation, like this:

+--------------------------+          +----------------+                     
| A1 |   Lighthouse        |          |      /-  B1    | Lighthouse          
|  |  \  (2 peers)         |          |    /-     |    | (2 peers)           
|  |  |                    |          |  /-       |    |                     
|  |   \                   |          |/-         |    |                     
|  /    \                  |         /-           /    |                     
| |      \                 |       /- |          |     |                     
| |      |                 |     /-   |          |     |                     
| |       \                |   /-     |          |     |                     
| |   Lighthouse           | /-       |      --- B2    | Lighthouse          
| A2  (2 peers)            /-         |  ---/          | (2 peers)           
|  --       \            /-|         ---/              |                     
|    \-      \         /-  |     ---/ |                |                     
|      \-     \      /-    | ---/     |                |                     
|        \-   |    /-    ---/         |                |                     
|          \-  \ --  ---/  |          |                |                     
|            \- C1--/      |          |                |                     
|                          |          |                |                     
|             Teku         |          |                |                     
|            (4 peers)     |          |                |                     
|                          |          |                |                     
|                          |          |                |                     
|  localhost               |          |  remotehost    |                     
+--------------------------+          +----------------+

I’m not aware of a different genesis root. I’ll updload the config to Github so we all are on the same genesis: https://github.com/goerli/schlesi

Edit, having second thoughts about the genesis delay. How does this even work? If each nodes picks a random time between MIN_GENESIS_DELAY and MIN_GENESIS_DELAY * 2 how can you be certain about the correct genesis event?

q9f on Mar 28, 2020

Thanks for taking the time to elaborate. I used --discovery-address and extracted the correct ENR for the boot node records. Magically, the fork on node B reorganized now. I consider this resolved.

Thanks.

q9f on Mar 23, 2020