weave: Node loses pod connectivity after OOM
In random situations of OOMKiller getting triggered, after the node is back up again (i.e. in Ready
state) the node loses its pod connectivity.
Deleting the weave pod (and consequently it getting recreated) makes the issue go away.
What you expected to happen?
I expected the node to eventually recover from the OOM, and/or report its state as NotReady
if it hasn’t.
What happened?
The node reports its network state as ready, but one can not access pod IPs from that node or pods running on it.
How to reproduce it?
This is not fully reproducible, but almost all occurrences have been after some random pod causes OOMKiller to be triggered. We’ve successfully quarantined the bug on a node, and can examine it if further information is needed.
Versions:
$ weave version
weave 2.5.2
$ docker version
...
Server: Docker Engine - Community
Engine:
Version: 18.09.6
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: 481bc77
Built: Sat May 4 01:59:36 2019
OS/Arch: linux/amd64
Experimental: false
$ uname -a
Linux c4-b2 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:08:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ kubectl version
...
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:30:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Logs:
Lots of occurrences of the following lines:
...connection shutting down due to error: read tcp4...
...connection deleted...
...connection shutting down due to error during handshake: failed to receive remote protocol header...
but these only appear during the OOM, after that it just goes back to normal logs (e.g. Discovered remote MAC
)
Network:
$ ip route
...
10.32.0.0/12 dev weave proto kernel scope link src 10.33.128.0
$ ip -4 -o addr
...
6: weave inet 10.33.128.0/12 brd 10.47.255.255 scope global weave\ valid_lft forever preferred_lft forever
$ /home/weave/weave --local status # inside the weave container
Version: 2.5.2 (up to date; next check at 2019/09/02 21:13:01)
Service: router
Protocol: weave 1..2
Name: 66:cf:c7:9d:f2:00(c4-b2)
Encryption: disabled
PeerDiscovery: enabled
Targets: 16
Connections: 16 (15 established, 1 failed)
Peers: 16 (with 240 established connections)
TrustedSubnets: none
Service: ipam
Status: ready
Range: 10.32.0.0/12
DefaultSubnet: 10.32.0.0/12
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 19 (8 by maintainers)
The issue has occurred again, and I was able to quarantine it on a single node. It should be possible to inspect it. An interesting aspect of the issue seems to be that the OOM has NOT killed
weave
itself. It has killed another container (in this case prometheus) but this has led to the node losing its connectivity.We’re experiencing the same thing, this should not be closed.