weave: Node loses pod connectivity after OOM

In random situations of OOMKiller getting triggered, after the node is back up again (i.e. in Ready state) the node loses its pod connectivity. Deleting the weave pod (and consequently it getting recreated) makes the issue go away.

What you expected to happen?

I expected the node to eventually recover from the OOM, and/or report its state as NotReady if it hasn’t.

What happened?

The node reports its network state as ready, but one can not access pod IPs from that node or pods running on it.

How to reproduce it?

This is not fully reproducible, but almost all occurrences have been after some random pod causes OOMKiller to be triggered. We’ve successfully quarantined the bug on a node, and can examine it if further information is needed.

Versions:

$ weave version
weave 2.5.2
$ docker version
...
Server: Docker Engine - Community
 Engine:
  Version:          18.09.6
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       481bc77
  Built:            Sat May  4 01:59:36 2019
  OS/Arch:          linux/amd64
  Experimental:     false
$ uname -a
Linux c4-b2 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:08:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ kubectl version
...
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:30:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Lots of occurrences of the following lines:

...connection shutting down due to error: read tcp4...
...connection deleted...
...connection shutting down due to error during handshake: failed to receive remote protocol header...

but these only appear during the OOM, after that it just goes back to normal logs (e.g. Discovered remote MAC)

Network:

$ ip route
...
10.32.0.0/12 dev weave  proto kernel  scope link  src 10.33.128.0
$ ip -4 -o addr
...
6: weave    inet 10.33.128.0/12 brd 10.47.255.255 scope global weave\       valid_lft forever preferred_lft forever
$ /home/weave/weave --local status # inside the weave container
        Version: 2.5.2 (up to date; next check at 2019/09/02 21:13:01)

        Service: router
       Protocol: weave 1..2
           Name: 66:cf:c7:9d:f2:00(c4-b2)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 16
    Connections: 16 (15 established, 1 failed)
          Peers: 16 (with 240 established connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.32.0.0/12
  DefaultSubnet: 10.32.0.0/12

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 1
Comments: 19 (8 by maintainers)

Most upvoted comments

The issue has occurred again, and I was able to quarantine it on a single node. It should be possible to inspect it. An interesting aspect of the issue seems to be that the OOM has NOT killed weave itself. It has killed another container (in this case prometheus) but this has led to the node losing its connectivity.

semekh on Oct 24, 2019

We’re experiencing the same thing, this should not be closed.

choseh on Jan 29, 2020