lighthouse: Beacon Node: Unable to recover from network fragmentation
Description
Given is a custom beacon-chain testnet with nodes in physical distinct network locations [A, B]. The nodes have both an ENR supplied in a testnet directory specified as boot-enr.yaml and (due to inability to sufficiently network through ENR) a multi-address command line flag.
The ENR file looks like this: gist/d6eea3ea3356e41bde81864143284ce9#file-4-boot_enr-yaml
The multi-address look like this:
--libp2p-addresses /ip4/51.158.190.99/tcp/9000,/ip4/87.180.203.227/tcp/9000
Version
~/.opt/lighthouse master*
❯ git log -n1
commit f6a6de2c5d9e6fe83b6ded24bad93615f98e2163 (HEAD -> master, origin/master, origin/HEAD)
Author: Sacha Saint-Leger <sacha.saint-leger@mail.mcgill.ca>
Date: Mon Mar 23 09:21:53 2020 +0100
Become a Validator guides: update (#928)
~/.opt/lighthouse master*
❯ rustc --version
rustc 1.42.0 (b8cedc004 2020-03-09)
Present Behaviour
In case there is a network fragmentation between A and B, the nodes do not attempt to reconnect to each other. Both nodes know about each other both from the ENR and the multi-address format.
Furthermore, if both A and B are validators, there happens a chain split with two different head slots.
It’s possible to reconnect the nodes by restarting the beacon chain nodes manually, however, the chains of A and B are now unable to reorganize to the best head and the peers ban each other.
Mar 23 10:53:27.534 ERRO Disconnecting and banning peer timeout: 30s, peer_id: PeerId("16Uiu2HAkxE6kBjfoGtSfhSJAE8oib6h3gM972pAj9brmthTHuuP2"), service: network
Mar 23 10:53:27.612 ERRO Disconnecting and banning peer timeout: 30s, peer_id: PeerId("16Uiu2HAmPz5xFwZfY4CYN6fxf3Yz6LQaDfzCUrA5qwCoTHoCSiNR"), service: network
Mar 23 10:53:38.000 WARN Low peer count peer_count: 1, service: slot_notifier
In this case, deleting either A’s or B’s beacon chain and temporarily stopping the validators is the only way to recover from this issue.
Expected Behaviour
The beacon chain node should aggressively try to maintain connections. It should keep trying to resolve ENR and multi-addresses even after a disconnect.
I don’t know how it’s designed, but I imagine:
- having higher timeouts to prevent disconnects in case of short network fragmentation
- having repeated connection attempts even if previous attempts failed in case of longer or more severe network fragmentation
I have not enough understanding of consensus to have a suggestion for how to handle the reorganization, but having a more stable network would certainly help here.
Steps to resolve
Manual restart including reset of the beacon chain data directory.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 31 (31 by maintainers)
I can no longer observe this. Networking seems to have improved in that regard. Happy to close this.
Ok. Thanks for investigating this. My testnet is now in a state where I can no longer recover it. So full success 🎉 😄
I will create a new testnet soon with better distribution of the validators but will keep the bootnodes running for a couple more days. Let me know if there’s anything else I can do to help.
Sorry this was a typo, I should have said “genesis delay” not “genesis root”. Since I didn’t originally set the genesis delay to
160000my genesis state had a different genesis time.Instead of selecting randomly, we do this:
So, if
MIN_GENESIS_DELAYis 24hrs, we wait until the next midnight (UTC) that is at least 24hrs away. So an eth1 timestamp at midday Tuesday ends up with an eth2 genesis on Wednesday/Thursday midnight.I added a Teku node which is able to keep the network connected between the local and remote host even after the network fragmentation, like this:
I’m not aware of a different genesis root. I’ll updload the config to Github so we all are on the same genesis: https://github.com/goerli/schlesi
Edit, having second thoughts about the genesis delay. How does this even work? If each nodes picks a random time between
MIN_GENESIS_DELAYandMIN_GENESIS_DELAY * 2how can you be certain about the correct genesis event?Thanks for taking the time to elaborate. I used
--discovery-addressand extracted the correct ENR for the boot node records. Magically, the fork on nodeBreorganized now. I consider this resolved.Thanks.