calico: Connection errors when running Calico on eBPF mode with more than one backend pod

Following the discussion on Slack about running Calico on eBPF: https://calicousers.slack.com/archives/CUKP5S64R/p1601550116016000

When running Calico on eBPF mode with a Kubernetes Service listening on a NodePort and running two backend pods, I have experienced 43 connection failures, out of around 250 thousand requests.

Running the same scenario, but with one backend pod, resulted in absolutely no connection errors.

The same setup was used with Calico on iptables mode and there were also no connection errors, regardless of the number of backend pods.

Setup

Kubernetes v1.19.2 (self-managed, the hard way) running on AWS with one master node and two worker nodes. Nodes run on Ubuntu 20.04.1 LTS with Kernel 5.4.0-1029-aws and Docker-runtime 19.3.8.

Calico v3.17.1 on eBPF mode and flags FELIX_BPFENABLED=true, CALICO_IPV4POOL_IPIP=Never, CALICO_IPV4POOL_VXLAN=Never, CALICO_IPV4POOL_NAT_OUTGOING=true.

Taurus was configured to do 20 concurrent requests for 10 minutes and making requests to both nodes on the exposed node port. Therefore, each node received around 125 thousand requests each.

There was no conversion from eBPF to iptables or vice-versa, the cluster was created from scratch with the specific setup.

Client IP = 10.199.1.4
Node one
- Node = 10.209.0.203:30080
- Pod = 10.210.76.132:8080
Node two
- Node = 10.209.2.205:30080
- Pod = 10.210.79.5:8080

Running only the bare minimum pods.

# kubectl --context instapro.calico get pods --all-namespaces
NAMESPACE     NAME                                                                 READY   STATUS    RESTARTS   AGE
echoserver    echoserver-588cbfb4d6-4zqdq                                          1/1     Running   0          39m <- Scaled up and down during the runs
kube-system   calico-kube-controllers-85764cbd48-jpzhj                             1/1     Running   0          3h8m
kube-system   calico-node-qb6rn                                                    1/1     Running   0          104m
kube-system   calico-node-w2p2h                                                    1/1     Running   0          104m
kube-system   calico-node-zs7x4                                                    1/1     Running   0          104m
kube-system   coredns-65f6755d5c-f4vmf                                             1/1     Running   0          120m
kube-system   coredns-65f6755d5c-jgst9                                             1/1     Running   0          3h6m
kube-system   coredns-65f6755d5c-sh68p                                             1/1     Running   0          120m

Service with NodePort

# kubectl --context instapro.calico --namespace echoserver get service echoserver --output yaml
apiVersion: v1
kind: Service
metadata:
  name: echoserver
  namespace: echoserver
spec:
  clusterIP: 10.211.138.184
  externalTrafficPolicy: Cluster
  ports:
  - name: http
    nodePort: 30080
    port: 80
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/name: echoserver
  sessionAffinity: None
  type: NodePort
status:
  loadBalancer: {}

Deployment

# kubectl --context instapro.calico --namespace echoserver get deployment echoserver --output yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echoserver
  namespace: echoserver
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: echoserver
  template:
    metadata:
      labels:
        app.kubernetes.io/name: echoserver
    spec:
      containers:
        - name: echoserver
          image: gcr.io/google_containers/echoserver:1.10
          ports:
            - name: http
              containerPort: 8080

One pod, no failures

I have run it multiple times. It hasn’t failed so far.

18:43:29 INFO: Test duration: 0:10:13
18:43:29 INFO: Samples count: 309985, 0.00% failures
18:43:29 INFO: Average times: total 0.038, latency 0.038, connect 0.001
18:43:29 INFO: Percentiles:
┌───────────────┬───────────────┐
│ Percentile, % │ Resp. Time, s │
├───────────────┼───────────────┤
│           0.0 │         0.025 │
│          50.0 │         0.034 │
│          90.0 │         0.045 │
│          95.0 │         0.062 │
│          99.0 │         0.089 │
│          99.9 │         0.269 │
│         100.0 │          1.95 │
└───────────────┴───────────────┘
18:43:29 INFO: Request label stats:
┌───────────────────────────┬────────┬─────────┬────────┬───────┐
│ label                     │ status │    succ │ avg_rt │ error │
├───────────────────────────┼────────┼─────────┼────────┼───────┤
│ http://10.209.0.203:30080 │   OK   │ 100.00% │  0.038 │       │
│ http://10.209.2.205:30080 │   OK   │ 100.00% │  0.038 │       │
└───────────────────────────┴────────┴─────────┴────────┴───────┘

One pod on each node, 43 failures

tcpdump’s are available here: https://www.dropbox.com/sh/hydbxyrlo9qdvwa/AACMxFb9YlbwSo1NzZLNGdyHa

I have run it many times. The number of failures is not static. It always fails.

18:59:49 INFO: Test duration: 0:13:04
18:59:49 INFO: Samples count: 258401, 0.02% failures
18:59:49 INFO: Average times: total 0.050, latency 0.036, connect 0.001
18:59:49 INFO: Percentiles:
┌───────────────┬───────────────┐
│ Percentile, % │ Resp. Time, s │
├───────────────┼───────────────┤
│           0.0 │         0.024 │
│          50.0 │         0.033 │
│          90.0 │         0.041 │
│          95.0 │         0.056 │
│          99.0 │         0.076 │
│          99.9 │         0.233 │
│         100.0 │        256.64 │
└───────────────┴───────────────┘
18:59:49 INFO: Request label stats:
┌───────────────────────────┬────────┬────────┬────────┬────────────────────────────────────────────────┐
│ label                     │ status │   succ │ avg_rt │ error                                          │
├───────────────────────────┼────────┼────────┼────────┼────────────────────────────────────────────────┤
│ http://10.209.0.203:30080 │  FAIL  │ 99.98% │  0.048 │ Non HTTP response message: Connection reset    │
│                           │        │        │        │ Non HTTP response message: Operation timed out │
│ http://10.209.2.205:30080 │  FAIL  │ 99.98% │  0.051 │ Non HTTP response message: Connection reset    │
│                           │        │        │        │ Non HTTP response message: Operation timed out │
└───────────────────────────┴────────┴────────┴────────┴────────────────────────────────────────────────┘

tcpdump’s commands

Node one

# tcpdump -i any port 30080 and src host 10.199.1.4 -nlvv -w node-10.209.0.203-30080.tcpdump
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes
277387 packets captured
277387 packets received by filter
0 packets dropped by kernel

# tcpdump -i calibc48453fc8e src host 10.199.1.4 -nlvv -w node-10.209.0.203-pod.tcpdump
tcpdump: listening on calibc48453fc8e, link-type EN10MB (Ethernet), capture size 262144 bytes
278440 packets captured
278440 packets received by filter
0 packets dropped by kernel

Node two

# tcpdump -i any port 30080 and src host 10.199.1.4 -nlvv -w node-10.209.2.205-30080.tcpdump
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes
277433 packets captured
277433 packets received by filter
0 packets dropped by kernel

# tcpdump -i cali8d5c3169ef1 src host 10.199.1.4 -nlvv -w node-10.209.2.205-pod.tcpdump
tcpdump: listening on cali8d5c3169ef1, link-type EN10MB (Ethernet), capture size 262144 bytes
276258 packets captured
276258 packets received by filter
0 packets dropped by kernel

Other runs

2 pods, all on node one = 19 failures 2 pods, all on node two = 22 failures

10 pods, all on node one = 31 failures 10 pods, all on node two = 26 failures

10 pods, 5 on each node = 24 failures 50 pods, 25 on each node = 38 failures

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 19 (8 by maintainers)

Most upvoted comments

Yes, I just got back to that and added a test so I’m just waiting on review now.

fasaxc on Mar 16, 2021