weave: Intermittent packet loss between pods

After seeking advice on Slack, @brb advised that I log this as an issue.

We’re seeing a small amount of intermittent packet loss on our Production environment, @backwardspy wrote a small rust application which pretends to speak http in order to further troubleshoot the issue.

use std::io::Write;
use std::net::{TcpListener, TcpStream};

const DATA: &[u8] = b"HTTP/1.1 200 OK
Content-Type: text
Content-Length: 6

\"echo\"";

fn handle_client(mut stream: TcpStream) {
    stream.write(DATA).unwrap();
}

fn main() {
    let listener = TcpListener::bind("0.0.0.0:8080").unwrap();

    // accept connections and process them serially
    for stream in listener.incoming() {
        handle_client(stream.unwrap());
    }
}

Running the above locally allows us to make approximately 800 connections per second using a Python client so it’s reasonably performant. Once we deployed this into our Kubernetes cluster I launched a Ubuntu pod and put curl in a while true loop while true; do curl my360-jeff:8080; done, after approximately 350 connections we see instances of curl: (56) Recv failure: Connection reset by peer every now and then.

On Slack I was advised to look at the Weave logs which saw instances of (about 100 per hour):

# prod-worker-01
ERRO: 2018/07/24 14:36:20.139952 Captured frame from MAC (2e:2e:c5:80:80:c6) to (12:61:a3:9e:b0:b0) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
ERRO: 2018/07/24 14:36:20.136592 Captured frame from MAC (2e:2e:c5:80:80:c6) to (ae:d5:37:9e:f4:77) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)

# prod-worker-08:
ERRO: 2018/07/24 14:36:38.431249 Captured frame from MAC (2e:2e:c5:80:80:c6) to (12:61:a3:9e:b0:b0) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
ERRO: 2018/07/24 14:36:38.426140 Captured frame from MAC (2e:2e:c5:80:80:c6) to (ae:d5:37:9e:f4:77) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)

# prod-worker-10:
ERRO: 2018/07/24 14:35:48.005625 Captured frame from MAC (2e:2e:c5:80:80:c6) to (12:61:a3:9e:b0:b0) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
ERRO: 2018/07/24 14:35:48.003711 Captured frame from MAC (2e:2e:c5:80:80:c6) to (ae:d5:37:9e:f4:77) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)

Additionally netstat -i shows some TX-DRP on the vxlan-6784 interface on some nodes:

# prod-worker-01 (1972 TX-DRP)
vxlan-6784 65535 0  48350779        0      0 0      52282585      0   1972      0 BMRU
# prod-worker-02 (29 TX-DRP)
vxlan-6784 65535 0  659979670       0      0 0      774399398     0     29      0 BMRU
# prod-worker-03 (74 TX-DRP)
vxlan-6784 65535 0  3232560976      0      0 0      3122201921    0     74      0 BMRU
# prod-worker-04 (206 TX-DRP)
vxlan-6784 65535 0  1266458622      0      0 0      1270704282    0    206      0 BMRU
# prod-worker-05 (88 TX-DRP)
vxlan-6784 65535 0  316929245       0      0 0      268493199     0     88      0 BMRU
# prod-worker-06 (52 TX-DRP)
vxlan-6784 65535 0  280753898       0      0 0      268094608     0     52      0 BMRU
# prod-worker-07 (30 TX-DRP)
vxlan-6784 65535 0  268853355       0      0 0      99970242      0     30      0 BMRU
# prod-worker-08 (0 TX-DRP)
vxlan-6784 65535 0  76848987        0      0 0      78291696      0      0      0 BMRU
# prod-worker-09 (455 TX-DRP)
vxlan-6784 65535 0  255571354       0      0 0      267805374     0    455      0 BMRU
# prod-worker-10 (8 TX-DRP)
vxlan-6784 65535 0  1057054491      0      0 0      1059490874    0      8      0 BMRU

What you expected to happen?

Connections arrive at the correct pod

What happened?

curl produces curl: (56) Recv failure: Connection reset by peer

How to reproduce it?

For us, just sending lots of traffic over the network shows the issue quite well.

Anything else we need to know?

  • Microsoft Azure
  • Ubuntu 16.04
  • Kubernetes 1.11.1
  • Kubernetes configured manually, can provide systemd unit examples if required.

Versions:

Weave:

cpressland@prod-worker-08 ~ sudo docker exec -it $(sudo docker ps  | grep weave_weave | cut -c 1-12) ./weave --local version
weave 2.3.0

Docker:

cpressland@prod-worker-08 ~ sudo docker version
Client:
 Version:      17.03.2-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 03:35:14 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 03:35:14 2017
 OS/Arch:      linux/amd64
 Experimental: false

uname -a:

cpressland@prod-worker-08 ~ uname -a
Linux prod-worker-08 4.15.0-1014-azure #14~16.04.1-Ubuntu SMP Thu Jun 14 15:42:55 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

kubectl:

cpressland@prod-controller-01 ~ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:53:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:43:26Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Weave Logs: weave-logs-1-hour.xlsx

Nothing interesting or concerning in the kubelet or docker logs.

Network:

ip route (worker-06):

cpressland@prod-worker-06 ~ ip route
default via 10.2.1.1 dev eth0
10.2.1.0/24 dev eth0  proto kernel  scope link  src 10.2.1.9
10.32.0.0/12 dev weave  proto kernel  scope link  src 10.34.0.1
168.63.129.16 via 10.2.1.1 dev eth0
169.254.169.254 via 10.2.1.1 dev eth0
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1 linkdown

ip -4 -o addr

cpressland@prod-worker-06 ~ ip -4 -o addr
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 10.2.1.9/24 brd 10.2.1.255 scope global eth0\       valid_lft forever preferred_lft forever
3: docker0    inet 172.17.0.1/16 scope global docker0\       valid_lft forever preferred_lft forever
6: weave    inet 10.34.0.1/12 brd 10.47.255.255 scope global weave\       valid_lft forever preferred_lft forever

sudo iptables-save: https://gist.github.com/cpressland/9745d9e2fd9547c06e84b6ba11aede5e

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 1
  • Comments: 17 (9 by maintainers)

Most upvoted comments

@Hashfyre Please open a new issue.

Issue reported in this bug is nothing to do with Weave.