weave: Intermittent packet loss between pods

After seeking advice on Slack, @brb advised that I log this as an issue.

We’re seeing a small amount of intermittent packet loss on our Production environment, @backwardspy wrote a small rust application which pretends to speak http in order to further troubleshoot the issue.

use std::io::Write;
use std::net::{TcpListener, TcpStream};

const DATA: &[u8] = b"HTTP/1.1 200 OK
Content-Type: text
Content-Length: 6

\"echo\"";

fn handle_client(mut stream: TcpStream) {
    stream.write(DATA).unwrap();
}

fn main() {
    let listener = TcpListener::bind("0.0.0.0:8080").unwrap();

    // accept connections and process them serially
    for stream in listener.incoming() {
        handle_client(stream.unwrap());
    }
}

Running the above locally allows us to make approximately 800 connections per second using a Python client so it’s reasonably performant. Once we deployed this into our Kubernetes cluster I launched a Ubuntu pod and put curl in a while true loop while true; do curl my360-jeff:8080; done, after approximately 350 connections we see instances of curl: (56) Recv failure: Connection reset by peer every now and then.

On Slack I was advised to look at the Weave logs which saw instances of (about 100 per hour):

# prod-worker-01
ERRO: 2018/07/24 14:36:20.139952 Captured frame from MAC (2e:2e:c5:80:80:c6) to (12:61:a3:9e:b0:b0) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
ERRO: 2018/07/24 14:36:20.136592 Captured frame from MAC (2e:2e:c5:80:80:c6) to (ae:d5:37:9e:f4:77) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)

# prod-worker-08:
ERRO: 2018/07/24 14:36:38.431249 Captured frame from MAC (2e:2e:c5:80:80:c6) to (12:61:a3:9e:b0:b0) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
ERRO: 2018/07/24 14:36:38.426140 Captured frame from MAC (2e:2e:c5:80:80:c6) to (ae:d5:37:9e:f4:77) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)

# prod-worker-10:
ERRO: 2018/07/24 14:35:48.005625 Captured frame from MAC (2e:2e:c5:80:80:c6) to (12:61:a3:9e:b0:b0) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
ERRO: 2018/07/24 14:35:48.003711 Captured frame from MAC (2e:2e:c5:80:80:c6) to (ae:d5:37:9e:f4:77) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)

Additionally netstat -i shows some TX-DRP on the vxlan-6784 interface on some nodes:

# prod-worker-01 (1972 TX-DRP)
vxlan-6784 65535 0  48350779        0      0 0      52282585      0   1972      0 BMRU
# prod-worker-02 (29 TX-DRP)
vxlan-6784 65535 0  659979670       0      0 0      774399398     0     29      0 BMRU
# prod-worker-03 (74 TX-DRP)
vxlan-6784 65535 0  3232560976      0      0 0      3122201921    0     74      0 BMRU
# prod-worker-04 (206 TX-DRP)
vxlan-6784 65535 0  1266458622      0      0 0      1270704282    0    206      0 BMRU
# prod-worker-05 (88 TX-DRP)
vxlan-6784 65535 0  316929245       0      0 0      268493199     0     88      0 BMRU
# prod-worker-06 (52 TX-DRP)
vxlan-6784 65535 0  280753898       0      0 0      268094608     0     52      0 BMRU
# prod-worker-07 (30 TX-DRP)
vxlan-6784 65535 0  268853355       0      0 0      99970242      0     30      0 BMRU
# prod-worker-08 (0 TX-DRP)
vxlan-6784 65535 0  76848987        0      0 0      78291696      0      0      0 BMRU
# prod-worker-09 (455 TX-DRP)
vxlan-6784 65535 0  255571354       0      0 0      267805374     0    455      0 BMRU
# prod-worker-10 (8 TX-DRP)
vxlan-6784 65535 0  1057054491      0      0 0      1059490874    0      8      0 BMRU

What you expected to happen?

Connections arrive at the correct pod

What happened?

curl produces curl: (56) Recv failure: Connection reset by peer

How to reproduce it?

For us, just sending lots of traffic over the network shows the issue quite well.

Anything else we need to know?

Microsoft Azure
Ubuntu 16.04
Kubernetes 1.11.1
Kubernetes configured manually, can provide systemd unit examples if required.

Versions:

Weave:

cpressland@prod-worker-08 ~ sudo docker exec -it $(sudo docker ps  | grep weave_weave | cut -c 1-12) ./weave --local version
weave 2.3.0

Docker:

cpressland@prod-worker-08 ~ sudo docker version
Client:
 Version:      17.03.2-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 03:35:14 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 03:35:14 2017
 OS/Arch:      linux/amd64
 Experimental: false

uname -a:

cpressland@prod-worker-08 ~ uname -a
Linux prod-worker-08 4.15.0-1014-azure #14~16.04.1-Ubuntu SMP Thu Jun 14 15:42:55 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

kubectl:

cpressland@prod-controller-01 ~ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:53:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:43:26Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Weave Logs: weave-logs-1-hour.xlsx

Nothing interesting or concerning in the kubelet or docker logs.

Network:

ip route (worker-06):

cpressland@prod-worker-06 ~ ip route
default via 10.2.1.1 dev eth0
10.2.1.0/24 dev eth0  proto kernel  scope link  src 10.2.1.9
10.32.0.0/12 dev weave  proto kernel  scope link  src 10.34.0.1
168.63.129.16 via 10.2.1.1 dev eth0
169.254.169.254 via 10.2.1.1 dev eth0
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1 linkdown

ip -4 -o addr

cpressland@prod-worker-06 ~ ip -4 -o addr
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 10.2.1.9/24 brd 10.2.1.255 scope global eth0\       valid_lft forever preferred_lft forever
3: docker0    inet 172.17.0.1/16 scope global docker0\       valid_lft forever preferred_lft forever
6: weave    inet 10.34.0.1/12 brd 10.47.255.255 scope global weave\       valid_lft forever preferred_lft forever

sudo iptables-save: https://gist.github.com/cpressland/9745d9e2fd9547c06e84b6ba11aede5e

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 1
Comments: 17 (9 by maintainers)

Most upvoted comments

@Hashfyre Please open a new issue.

Issue reported in this bug is nothing to do with Weave.

murali-reddy on Feb 26, 2019