weave: weave losing connections to other nodes with error: Multiple connections (Kubernetes CNI)

What you expected to happen?

Inter-node cluster-internal traffic to work

What happened?

At random times one nodes’ pod network becomes unreachable/can’t connect to other nodes’ pod-network. Nodes internal traffic still works

Deleting the pod fixes the issue temporarily

Anything else we need to know?

Baremetal deployment with 3 nodes (1 master, 2 workers), metallb in L2 mode and WEAVE_MTU set to 1500 and NO_MASQ_LOCAL set to 1

Versions:

$ weave version
weave script 2.5.1
weave 2.5.1

$ docker version
 Version:           18.09.2
 API version:       1.39
 Go version:        go1.10.4
 Git commit:        6247962
 Built:             Tue Feb 26 23:52:23 2019
 OS/Arch:           linux/amd64
 Experimental:      false

$ uname -a
Linux k8sm1 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-01T20:00:57Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Error occurred around 08:36 server time

$ kubectl logs -n kube-system <weave-net-pod> weave
INFO: 2019/03/20 08:34:47.205497 Sending ICMP 3,4 (10.32.0.71 -> 10.40.0.87): PMTU=1438
INFO: 2019/03/20 08:35:31.789047 Sending ICMP 3,4 (10.32.0.69 -> 10.40.0.87): PMTU=1438
INFO: 2019/03/20 08:36:21.533420 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:36:21.533536 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: no working forwarders to ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:36:21.533888 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:36:21.534236 ->[192.168.100.72:6783] attempting connection
INFO: 2019/03/20 08:36:21.534529 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:36:21.534565 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: no working forwarders to 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:36:21.534642 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:36:21.534686 Removed unreachable peer 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:36:21.534702 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:36:21.534815 ->[192.168.100.83:6783] attempting connection
INFO: 2019/03/20 08:36:21.535475 ->[192.168.100.72:34749] connection accepted
INFO: 2019/03/20 08:36:21.536439 ->[192.168.100.83:56185] connection accepted
INFO: 2019/03/20 08:36:21.536784 ->[192.168.100.72:34749|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:36:21.536816 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:36:21.536911 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:36:21.536958 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:36:21.536982 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:36:21.537033 ->[192.168.100.72:34749|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:36:21.537067 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:36:21.537121 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:36:21.537254 ->[192.168.100.83:56185|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:36:21.537418 ->[192.168.100.72:34749|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:36:21.537428 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:36:21.537466 ->[192.168.100.72:34749|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: Multiple connections to ae:3f:91:18:2e:cb(k8sm1) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:36:21.537573 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:36:21.538144 ->[192.168.100.83:56185|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: Multiple connections to 9e:34:c5:bd:b8:9e(k8sw2) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:36:21.538952 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using sleeve
INFO: 2019/03/20 08:36:21.538997 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection fully established
INFO: 2019/03/20 08:36:21.539047 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/20 08:36:21.539820 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using sleeve
INFO: 2019/03/20 08:36:21.539865 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection fully established
INFO: 2019/03/20 08:36:21.539889 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/20 08:36:21.540080 sleeve ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: Effective MTU verified at 1438
INFO: 2019/03/20 08:36:21.540633 sleeve ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: Effective MTU verified at 1438
INFO: 2019/03/20 08:37:21.538009 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:37:21.538103 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: no working forwarders to 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:37:21.538208 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:37:21.538918 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:37:21.538991 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: no working forwarders to ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:37:21.539300 ->[192.168.100.83:58245] connection accepted
INFO: 2019/03/20 08:37:21.539487 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:37:21.539927 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:37:21.540016 Removed unreachable peer 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:37:21.540245 ->[192.168.100.83:6783] attempting connection
INFO: 2019/03/20 08:37:21.540329 ->[192.168.100.72:6783] attempting connection
INFO: 2019/03/20 08:37:21.540428 ->[192.168.100.83:58245|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.540520 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:37:21.540564 ->[192.168.100.83:58245|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:37:21.541363 ->[192.168.100.72:34149] connection accepted
INFO: 2019/03/20 08:37:21.541773 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.541884 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:37:21.541919 ->[192.168.100.83:58245|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:37:21.542005 ->[192.168.100.83:58245|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: Multiple connections to 9e:34:c5:bd:b8:9e(k8sw2) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:37:21.542359 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:37:21.542428 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:37:21.544058 ->[192.168.100.83:60179] connection accepted
INFO: 2019/03/20 08:37:21.544313 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.544401 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:37:21.544435 ->[192.168.100.72:34149|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.544515 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:37:21.544654 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:37:21.544768 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:37:21.544806 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using sleeve
INFO: 2019/03/20 08:37:21.544841 ->[192.168.100.72:34149|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:37:21.544929 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection fully established
INFO: 2019/03/20 08:37:21.544771 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/20 08:37:21.545106 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: Multiple connections to ae:3f:91:18:2e:cb(k8sm1) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:37:21.545485 sleeve ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: Effective MTU verified at 1438
INFO: 2019/03/20 08:37:21.545499 ->[192.168.100.83:60179|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.545576 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:37:21.545624 ->[192.168.100.83:60179|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: Multiple connections to 9e:34:c5:bd:b8:9e(k8sw2) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:38:21.543908 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:38:21.544002 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: no working forwarders to 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:38:21.544094 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:38:21.544839 ->[192.168.100.83:6783] attempting connection
INFO: 2019/03/20 08:38:21.545145 ->[192.168.100.83:53093] connection accepted
INFO: 2019/03/20 08:38:21.545377 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using sleeve
INFO: 2019/03/20 08:38:21.545653 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:38:21.545717 ->[192.168.100.72:34149|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: no working forwarders to ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:38:21.545796 ->[192.168.100.72:34149|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:38:21.545840 Removed unreachable peer 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:38:21.545859 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:38:21.545971 ->[192.168.100.83:53093|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:38:21.546050 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:38:21.546107 ->[192.168.100.83:53093|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:38:21.546289 ->[192.168.100.72:6783] attempting connection
INFO: 2019/03/20 08:38:21.546325 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:38:21.546526 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:38:21.546586 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: Multiple connections to 9e:34:c5:bd:b8:9e(k8sw2) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:38:21.546786 ->[192.168.100.72:36693] connection accepted
INFO: 2019/03/20 08:38:21.547484 ->[192.168.100.72:36693|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:38:21.547587 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:38:21.547669 ->[192.168.100.72:36693|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:38:21.547876 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:38:21.547987 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:38:21.548020 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: Multiple connections to ae:3f:91:18:2e:cb(k8sm1) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:39:21.546646 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:39:21.546717 ->[192.168.100.83:53093|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: no working forwarders to 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:39:21.546826 ->[192.168.100.83:53093|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:39:21.547406 ->[192.168.100.83:6783] attempting connection
INFO: 2019/03/20 08:39:21.548006 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:39:21.548072 ->[192.168.100.83:40009] connection accepted
INFO: 2019/03/20 08:39:21.548155 ->[192.168.100.72:36693|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: no working forwarders to ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:39:21.548281 ->[192.168.100.72:36693|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:39:21.548336 Removed unreachable peer 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:39:21.548371 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:39:21.548525 ->[192.168.100.72:6783] attempting connection
INFO: 2019/03/20 08:39:21.549229 ->[192.168.100.83:40009|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:39:21.549295 ->[192.168.100.72:45581] connection accepted
INFO: 2019/03/20 08:39:21.549327 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:39:21.549346 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:39:21.549437 ->[192.168.100.83:40009|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:39:21.549462 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:39:21.549797 ->[192.168.100.83:40009|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:39:21.549858 ->[192.168.100.72:45581|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:39:21.549914 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:39:21.549925 ->[192.168.100.83:40009|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: write tcp4 192.168.100.79:6783->192.168.100.83:40009: write: connection reset by peer
INFO: 2019/03/20 08:39:21.549980 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:39:21.550006 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:39:21.550154 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:39:21.550257 ->[192.168.100.72:45581|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:39:21.550368 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: Multiple connections to ae:3f:91:18:2e:cb(k8sm1) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:39:21.552197 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using sleeve
INFO: 2019/03/20 08:39:21.552260 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection fully established
INFO: 2019/03/20 08:39:21.552399 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/20 08:39:21.552982 sleeve ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: Effective MTU verified at 1438

weave status in working state:

# weave status

        Version: 2.5.1 (up to date; next check at 2019/03/20 14:13:47)

        Service: router
       Protocol: weave 1..2
           Name: f2:e1:3e:28:f1:80(k8sw1)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 3
    Connections: 3 (2 established, 1 failed)
          Peers: 3 (with 6 established connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.32.0.0/12
  DefaultSubnet: 10.32.0.0/12

# weave status peers
ae:3f:91:18:2e:cb(k8sm1)
   -> 192.168.100.83:6783   9e:34:c5:bd:b8:9e(k8sw2)              established
   <- 192.168.100.79:50229  f2:e1:3e:28:f1:80(k8sw1)              established
9e:34:c5:bd:b8:9e(k8sw2)
   <- 192.168.100.72:38745  ae:3f:91:18:2e:cb(k8sm1)              established
   <- 192.168.100.79:39057  f2:e1:3e:28:f1:80(k8sw1)              established
f2:e1:3e:28:f1:80(k8sw1)
   -> 192.168.100.72:6783   ae:3f:91:18:2e:cb(k8sm1)              established
   -> 192.168.100.83:6783   9e:34:c5:bd:b8:9e(k8sw2)              established

weave status connections
-> 192.168.100.72:6783   established sleeve ae:3f:91:18:2e:cb(k8sm1) mtu=1438
-> 192.168.100.83:6783   established sleeve 9e:34:c5:bd:b8:9e(k8sw2) mtu=1438
-> 192.168.100.79:6783   failed      cannot connect to ourself, retry: never

I will try to get the weave outputs during failure state, but didn’t have the weave script installed at the time/looked up weave troubleshooting and had to get the issue fixed asap.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 3
  • Comments: 21 (6 by maintainers)

Most upvoted comments

What OS are you guys using?

We’ve encountered tons of connection reset by peer issues, and I’ve finally traced it back to networkd usage on CoreOS (2079.5.1 as of this writing, but have been running various permutations since ~600). K8s 1.11.10, Weave 2.5.2 at this time.

See #3: https://www.weave.works/blog/running-a-weave-network-on-coreos/

Since I’m using Kops, adding a drop-in to install this file:

## /etc/systemd/network/10-weave.network
[Match]
Name=weave datapath vxlan-6784 dummy0

[Network]
Description=Network interfaces managed by weave

[Link]
Unmanaged=true

Seems to have helped.

[edit] Made the example drop-in more explicit

I think we’re hitting the same issue. Symptoms look identical.

We have nodes with these log chunks repeating over and over:

On weave-net-nq27f 3/3 Running 2 7d 10.186.0.157 kubernetes-kubernetes-cr2-2-1547767704:

INFO: 2019/03/26 10:43:19.850007 ->[10.186.0.187:32815] connection accepted
INFO: 2019/03/26 10:43:19.851079 ->[10.186.0.187:32815|f6:de:0b:60:9f:17(kubernetes-kubernetes-cr1-13-1547767704)]: connection ready; using protocol version 2
INFO: 2019/03/26 10:43:19.851282 overlay_switch ->[f6:de:0b:60:9f:17(kubernetes-kubernetes-cr1-13-1547767704)] using fastdp
INFO: 2019/03/26 10:43:19.851336 ->[10.186.0.187:32815|f6:de:0b:60:9f:17(kubernetes-kubernetes-cr1-13-1547767704)]: connection shutting down due to error: Multiple connections to f6:de:0b:60:9f:17(kubernetes-kubernetes-cr1-13-1547767704) added to 6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)

On weave-net-hh9rj 3/3 Running 27 7d 10.186.0.187 kubernetes-kubernetes-cr1-13-1547767704:

INFO: 2019/03/26 10:43:19.848875 ->[10.186.0.157:6783] attempting connection
INFO: 2019/03/26 10:43:19.850807 ->[10.186.0.157:6783|6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)]: connection ready; using protocol version 2
INFO: 2019/03/26 10:43:19.850973 overlay_switch ->[6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)] using fastdp
INFO: 2019/03/26 10:43:19.851002 ->[10.186.0.157:6783|6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)]: connection added
INFO: 2019/03/26 10:43:19.853848 Setting up IPsec between f6:de:0b:60:9f:17(kubernetes-kubernetes-cr1-13-1547767704) and 6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)
INFO: 2019/03/26 10:43:19.854323 ipsec: InitSALocal: 10.186.0.157 -> 10.186.0.187 :6784 0x73b96332
ERRO: 2019/03/26 10:43:19.876556 fastdp ->[10.186.0.157:6784|6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)]: ipsec init SA local failed: send InitSARemote: write tcp4 10.186.0.187:32815->10.186.0.157:6783: write: broken pipe
INFO: 2019/03/26 10:43:19.876658 ->[10.186.0.157:6783|6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)]: connection shutting down due to error: write tcp4 10.186.0.187:32815->10.186.0.157:6783: write: connection reset by peer
INFO: 2019/03/26 10:43:19.876719 ->[10.186.0.157:6783|6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)]: connection deleted
INFO: 2019/03/26 10:43:19.876735 Destroying IPsec between f6:de:0b:60:9f:17(kubernetes-kubernetes-cr1-13-1547767704) and 6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)
INFO: 2019/03/26 10:43:19.876802 ipsec: destroy: in 10.186.0.157 -> 10.186.0.187 0x73b96332
INFO: 2019/03/26 10:43:19.877741 overlay_switch ->[6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)] fastdp send InitSARemote: write tcp4 10.186.0.187:32815->10.186.0.157:6783: write: broken pipe
INFO: 2019/03/26 10:43:19.877776 overlay_switch ->[6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)] using sleeve