diem: [Bug] Need to randomize which VFNs public fullnodes connect to
🐛 Bug
My public fullnode has been running on premainnet since Saturday but has not been able to sync yet because it is unable to reach any of the 3 peers it’s trying to connect to. This set of 3 peers has persisted across restarts and a data wipe.
To reproduce
Try to start up a public fullnode on premainnet
Stack trace/error message
ERROR 2020-08-03 17:12:40 network/src/peer_manager/mod.rs:945 [Public,full_node,704c368e] Error dialing Peer 027bcd02 at /dns4/a618747270806470489784d397d672f5-a4ce02b3daf7fc56.elb.us-west-2.amazonaws.com/tcp/6182/ln-noise-ik/c890408bf3ad1bc13a7dd80d31d0d58c9a7372a32639a5f390e41653ef905b66/ln-handshake/0
ERROR 2020-08-03 17:13:25 network/src/peer_manager/mod.rs:945 [Public,full_node,704c368e] Error dialing Peer 1bb11a02 at /ip4/52.190.44.228/tcp/6182/ln-noise-ik/1e4a0c2439331956b0473fc56d257739ba62977bfdeb908d391b5ac75e706756/ln-handshake/0
ERROR 2020-08-03 17:13:25 network/src/peer_manager/mod.rs:945 [Public,full_node,704c368e] Error dialing Peer 30ca4f96 at /ip4/34.69.223.136/tcp/6182/ln-noise-ik/3b9ba1f7f5b5b1fe0c7c0f3824bf60dc0ab5093835d8a31bc8a62af2fc0a6107/ln-handshake/0
Expected Behavior
The three peers should be chosen at random at service start, and ideally a new set of random peers should be chosen at regular intervals.
System information
Please complete the following information:
- Libra 0.18
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 31 (27 by maintainers)
This reminds me, we should make a runbook based on the experiences we have here.
Clearly the first step now needs to be do you have the right genesis 😃
Yes, it is randomized I misspoke earlier 😦 https://github.com/libra/libra/blob/master/network/src/connectivity_manager/mod.rs#L385
I think this is caused by a wrong chain_id: https://github.com/libra/partners/pull/271
public network listeners don’t use connectivity manager (since they only service inbound connections). public network clients use connectivity manager to dial public endpoints
There should be logs from connectivity_manager of the form
"{} Failed to connect to peer: {} at address: {}; error: {}"that will tell us what kind of errorEDIT: Depending on the error, we can also inspect one of the listeners’ logs. My suspicion is that it’s a configuration error; probably the full node’s peer id is not derived from the pubkey…