calico: Calico in ebpf + vxlan mode (and wireguard) failed to be accessed from outside the cluster
Expected Behavior
Service of type LoadBalancer or NodePort can be accessed normally, whatever the backend pod location and source IP is preserved
Current Behavior
Service of type LoadBalancer or NodePort failed to be accessed from outside the cluster:
- From inside the service works on each node 100% of the time (wherever the node contains a backend pod or not)
- From outside the service see many connections timeout, and can be reached (sometime) if a backend pod is located on the accessed node
Steps to Reproduce (for bugs)
- Install kubernetes
- Install keepalived to move a VIP
- Install calico (see manifest and config below)
- Install metallb (only controller, see config below)
- Install a service of type LoadBalancer with backend pods (in my case nginx ingress controller)
- curl http://<public_VIP> multiple time. See many errors and only success if node that hold VIP has a backend pod.
- curl http://<node_IP>:<node_port> we can see the same behaviour.
Context
Each node has two interfaces (except the ones from calico):
- eth0:
- PublicIP from provider
- could receive VIP
- wg1 (setup manually wia ansible, not calico) :
- privateIP that mesh all the cluster (kubelet nodeIP)
Felix Config
apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
name: default
spec:
bpfEnabled: true
bpfExternalServiceMode: Tunnel
bpfLogLevel: Debug
ipipEnabled: false
logSeverityScreen: Info
reportingInterval: 0s
vxlanEnabled: true
vxlanMTU: 1370
Default and sing IPPool:
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: default-ipv4-ippool
spec:
blockSize: 26
cidr: 10.1.128.0/17
ipipMode: Never
natOutgoing: true
nodeSelector: all()
vxlanMode: Always
Manifest (Diff from calico-vxlan.yaml)
--- arch-cloud/roles/kube-calico/calico-vxlan.yaml 2021-08-02 22:25:06.359634474 +0200
+++ calico-vxlan.yaml 2021-08-06 23:39:51.393404973 +0200
@@ -15,7 +15,7 @@
# Configure the MTU to use for workload interfaces and tunnels.
# By default, MTU is auto-detected, and explicitly setting this field should not be required.
# You can override auto-detection by providing a non-zero value.
- veth_mtu: "0"
+ veth_mtu: "1370"
# The CNI network configuration to install on each node. The special
# values in this config will be automatically populated.
@@ -3879,14 +3879,11 @@
configMapKeyRef:
name: calico-config
key: veth_mtu
- # Disable AWS source-destination check on nodes.
- - name: FELIX_AWSSRCDSTCHECK
- value: Disable
# The default IPv4 pool to create on startup if none exists. Pod IPs will be
# chosen from this range. Changing this value after installation will have
# no effect. This should fall within `--cluster-cidr`.
- # - name: CALICO_IPV4POOL_CIDR
- # value: "192.168.0.0/16"
+ - name: CALICO_IPV4POOL_CIDR
+ value: "10.1.128.0/17"
# Disable file logging so `kubectl logs` works.
- name: CALICO_DISABLE_FILE_LOGGING
value: "true"
@@ -3970,7 +3967,7 @@
# Used to install CNI.
- name: cni-bin-dir
hostPath:
- path: /opt/cni/bin
+ path: /usr/lib/cni/
- name: cni-net-dir
hostPath:
path: /etc/cni/net.d
MetalLB chart values:
configInline:
address-pools:
- name: default
protocol: layer2
addresses:
- <Public_VIP>/32
speaker:
enabled: false
Your Environment
- Calico version: 3.19.3, 3.20.0 at least
- Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.21
- Operating System and version: Archlinux (latest)
- Cloud provider: oneprovider.com (Region: Paris, which in this case looks like old online.net/scaleway servers rebranded)
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (7 by maintainers)
What @apsega describe also perfectly describes the problem I see. When I setup(just setup, via systemd-networked setuping the peers and an IP) the wg1 interface, it stops working 4 or 5 seconds after. I can can get the traffic flowing normally back, doing:
Note: I’ve added a
50-interfaces-exception.networkwhich contains:I was about investigating why wireguard would result in such a behaviour but thanks to @apsega (What a timing !), I assume that wireguard is not at fault here
wg1------I: Drop packets with IP optionscc @mazdakn