weave: Lost K8S autoscaled nodes contacted indefinitely
What you expected to happen?
When a Kubernetes node is removed from our autoscaled cluster, Weave continues trying to contact it until an operator manually rmpeer & forget’s the node via the command line.
What happened?
weave-net-mlqj4 weave INFO: 2018/05/14 14:30:16.028571 ->[10.0.3.96:6783] attempting connection
weave-net-p5mjh weave INFO: 2018/05/14 14:30:17.357611 Discovered remote MAC 5a:b9:a1:0c:57:a6 at 1e:bc:cb:84:0d:2a(ip-10-0-3-37.us-east-2.compute.internal)
weave-net-mlqj4 weave INFO: 2018/05/14 14:30:19.091097 ->[10.0.3.96:6783] error during connection attempt: dial tcp4 :0->10.0.3.96:6783: connect: no route to host
weave-net-2bgzv weave INFO: 2018/05/14 14:30:25.250138 ->[10.0.3.93:6783] attempting connection
… about once a minute
How to reproduce it?
Remove a Kubernetes node (terminate the instance)
Anything else we need to know?
Kubernetes 1.9.4, kubeadm cluster on AWS.
Versions:
$ weave version
weave 2.3.0
$ docker version
Client:
Version: 17.05.0-ce
API version: 1.29
Go version: go1.7.5
Git commit: 89658be
Built: Thu May 4 22:10:54 2017
OS/Arch: linux/amd64
$ uname -a
Linux ip-10-0-3-160.us-east-2.compute.internal 4.8.0-59-generic #64-Ubuntu SMP Thu Jun 29 19:38:34 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.5", GitCommit:"f01a2bf98249a4db383560443a59bed0c13575df", GitTreeState:"clean", BuildDate:"2018-03-19T15:59:24Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.4", GitCommit:"bee2d1505c4fe820744d26d41ecd3fdd4a3d6546", GitTreeState:"clean", BuildDate:"2018-03-12T16:21:35Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Logs:
Grepped out a relevant host:
NFO: 2018/05/11 00:40:34.163026 Launch detected - using supplied peer list: [10.0.1.109 10.0.1.128 10.0.1.200 10.0.1.41 10.0.1.43 10.0.1.64 10.0.3.114 10.0.3.37 10.0.3.64 10.0.3.73 10.0.3.93]
INFO: 2018/05/11 00:40:34.175099 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:40:34.176893 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2018/05/11 00:40:34.177060 overlay_switch ->[5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)] using fastdp
INFO: 2018/05/11 00:40:34.179623 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection added (new peer)
INFO: 2018/05/11 00:40:34.200111 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection fully established
INFO: 2018/05/11 00:40:34.206113 sleeve ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: Effective MTU verified at 8939
INFO: 2018/05/11 00:40:35.102470 [kube-peers] Added myself to peer list &{[{e6:4b:b7:4c:df:d0 ip-10-0-1-128.us-east-2.compute.internal} {c2:18:9c:74:e7:f2 ip-10-0-3-64.us-east-2.compute.internal} {aa:84:b9:7a:13:5a ip-10-0-3-114.us-east-2.compute.internal} {86:41:ea:22:67:e5 ip-10-0-3-73.us-east-2.compute.internal} {e6:d6:b7:a2:bf:26 ip-10-0-1-41.us-east-2.compute.internal} {86:f7:66:2b:87:19 ip-10-0-1-200.us-east-2.compute.internal} {7a:e5:64:52:cd:43 ip-10-0-3-93.us-east-2.compute.internal} {1e:bc:cb:84:0d:2a ip-10-0-3-37.us-east-2.compute.internal} {22:02:44:cb:1a:e5 ip-10-0-1-109.us-east-2.compute.internal} {5e:31:34:2d:2e:e9 ip-10-0-1-64.us-east-2.compute.internal} {8a:50:e4:5b:fc:b3 ip-10-0-1-43.us-east-2.compute.internal}]}
INFO: 2018/05/11 00:40:35.968265 Discovered remote MAC 5e:31:34:2d:2e:e9 at 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:40:36.070057 Discovered remote MAC e2:05:43:3e:09:b9 at 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:40:36.070427 Discovered remote MAC b2:1b:27:4c:a4:52 at 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:40:36.215259 Discovered remote MAC 3a:3c:9f:e7:7d:08 at 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:42:00.633047 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection shutting down due to error: read tcp4 10.0.1.43:50037->10.0.1.64:6783: read: connection reset by peer
INFO: 2018/05/11 00:42:00.633138 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection deleted
INFO: 2018/05/11 00:42:00.640213 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:00.640585 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: connection refused
INFO: 2018/05/11 00:42:00.646541 Removed unreachable peer 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:42:03.457897 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:03.458262 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: connection refused
INFO: 2018/05/11 00:42:05.602050 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:05.602441 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: connection refused
INFO: 2018/05/11 00:42:11.717182 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:11.717543 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: connection refused
INFO: 2018/05/11 00:42:21.712218 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:40.031406 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:42:53.518432 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:56.575443 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:43:12.400009 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:43:15.455412 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:43:41.529976 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:43:44.607391 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:44:32.622034 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:44:35.679413 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:45:27.154253 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:45:30.207380 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:46:46.325098 ->[10.0.1.64:6783] attempting connection
… and so on for days. –>
See also issue #2797
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 32 (11 by maintainers)
Hi @m0rganic , recently this was fixed, as you can see in another issue I had open that was relevant (I think) to these old nodes cleanup tasks: https://github.com/weaveworks/weave/issues/3427#issuecomment-435193838