weave: Intermittent packet loss between pods
After seeking advice on Slack, @brb advised that I log this as an issue.
We’re seeing a small amount of intermittent packet loss on our Production environment, @backwardspy wrote a small rust application which pretends to speak http in order to further troubleshoot the issue.
use std::io::Write;
use std::net::{TcpListener, TcpStream};
const DATA: &[u8] = b"HTTP/1.1 200 OK
Content-Type: text
Content-Length: 6
\"echo\"";
fn handle_client(mut stream: TcpStream) {
stream.write(DATA).unwrap();
}
fn main() {
let listener = TcpListener::bind("0.0.0.0:8080").unwrap();
// accept connections and process them serially
for stream in listener.incoming() {
handle_client(stream.unwrap());
}
}
Running the above locally allows us to make approximately 800 connections per second using a Python client so it’s reasonably performant. Once we deployed this into our Kubernetes cluster I launched a Ubuntu pod and put curl in a while true loop while true; do curl my360-jeff:8080; done
, after approximately 350 connections we see instances of curl: (56) Recv failure: Connection reset by peer
every now and then.
On Slack I was advised to look at the Weave logs which saw instances of (about 100 per hour):
# prod-worker-01
ERRO: 2018/07/24 14:36:20.139952 Captured frame from MAC (2e:2e:c5:80:80:c6) to (12:61:a3:9e:b0:b0) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
ERRO: 2018/07/24 14:36:20.136592 Captured frame from MAC (2e:2e:c5:80:80:c6) to (ae:d5:37:9e:f4:77) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
# prod-worker-08:
ERRO: 2018/07/24 14:36:38.431249 Captured frame from MAC (2e:2e:c5:80:80:c6) to (12:61:a3:9e:b0:b0) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
ERRO: 2018/07/24 14:36:38.426140 Captured frame from MAC (2e:2e:c5:80:80:c6) to (ae:d5:37:9e:f4:77) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
# prod-worker-10:
ERRO: 2018/07/24 14:35:48.005625 Captured frame from MAC (2e:2e:c5:80:80:c6) to (12:61:a3:9e:b0:b0) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
ERRO: 2018/07/24 14:35:48.003711 Captured frame from MAC (2e:2e:c5:80:80:c6) to (ae:d5:37:9e:f4:77) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
Additionally netstat -i
shows some TX-DRP
on the vxlan-6784
interface on some nodes:
# prod-worker-01 (1972 TX-DRP)
vxlan-6784 65535 0 48350779 0 0 0 52282585 0 1972 0 BMRU
# prod-worker-02 (29 TX-DRP)
vxlan-6784 65535 0 659979670 0 0 0 774399398 0 29 0 BMRU
# prod-worker-03 (74 TX-DRP)
vxlan-6784 65535 0 3232560976 0 0 0 3122201921 0 74 0 BMRU
# prod-worker-04 (206 TX-DRP)
vxlan-6784 65535 0 1266458622 0 0 0 1270704282 0 206 0 BMRU
# prod-worker-05 (88 TX-DRP)
vxlan-6784 65535 0 316929245 0 0 0 268493199 0 88 0 BMRU
# prod-worker-06 (52 TX-DRP)
vxlan-6784 65535 0 280753898 0 0 0 268094608 0 52 0 BMRU
# prod-worker-07 (30 TX-DRP)
vxlan-6784 65535 0 268853355 0 0 0 99970242 0 30 0 BMRU
# prod-worker-08 (0 TX-DRP)
vxlan-6784 65535 0 76848987 0 0 0 78291696 0 0 0 BMRU
# prod-worker-09 (455 TX-DRP)
vxlan-6784 65535 0 255571354 0 0 0 267805374 0 455 0 BMRU
# prod-worker-10 (8 TX-DRP)
vxlan-6784 65535 0 1057054491 0 0 0 1059490874 0 8 0 BMRU
What you expected to happen?
Connections arrive at the correct pod
What happened?
curl produces curl: (56) Recv failure: Connection reset by peer
How to reproduce it?
For us, just sending lots of traffic over the network shows the issue quite well.
Anything else we need to know?
- Microsoft Azure
- Ubuntu 16.04
- Kubernetes 1.11.1
- Kubernetes configured manually, can provide systemd unit examples if required.
Versions:
Weave:
cpressland@prod-worker-08 ~ sudo docker exec -it $(sudo docker ps | grep weave_weave | cut -c 1-12) ./weave --local version
weave 2.3.0
Docker:
cpressland@prod-worker-08 ~ sudo docker version
Client:
Version: 17.03.2-ce
API version: 1.27
Go version: go1.7.5
Git commit: f5ec1e2
Built: Tue Jun 27 03:35:14 2017
OS/Arch: linux/amd64
Server:
Version: 17.03.2-ce
API version: 1.27 (minimum version 1.12)
Go version: go1.7.5
Git commit: f5ec1e2
Built: Tue Jun 27 03:35:14 2017
OS/Arch: linux/amd64
Experimental: false
uname -a:
cpressland@prod-worker-08 ~ uname -a
Linux prod-worker-08 4.15.0-1014-azure #14~16.04.1-Ubuntu SMP Thu Jun 14 15:42:55 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
kubectl:
cpressland@prod-controller-01 ~ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:53:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:43:26Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Logs:
Weave Logs: weave-logs-1-hour.xlsx
Nothing interesting or concerning in the kubelet or docker logs.
Network:
ip route (worker-06):
cpressland@prod-worker-06 ~ ip route
default via 10.2.1.1 dev eth0
10.2.1.0/24 dev eth0 proto kernel scope link src 10.2.1.9
10.32.0.0/12 dev weave proto kernel scope link src 10.34.0.1
168.63.129.16 via 10.2.1.1 dev eth0
169.254.169.254 via 10.2.1.1 dev eth0
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
ip -4 -o addr
cpressland@prod-worker-06 ~ ip -4 -o addr
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
2: eth0 inet 10.2.1.9/24 brd 10.2.1.255 scope global eth0\ valid_lft forever preferred_lft forever
3: docker0 inet 172.17.0.1/16 scope global docker0\ valid_lft forever preferred_lft forever
6: weave inet 10.34.0.1/12 brd 10.47.255.255 scope global weave\ valid_lft forever preferred_lft forever
sudo iptables-save: https://gist.github.com/cpressland/9745d9e2fd9547c06e84b6ba11aede5e
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 1
- Comments: 17 (9 by maintainers)
@Hashfyre Please open a new issue.
Issue reported in this bug is nothing to do with Weave.