k3s: the overlay network between nodes seems to be broken in Debian 11 (bullseye)
Environmental Info:
K3s Version:
k3s version v1.21.3+k3s1 (1d1f220f)
go version go1.16.6
Node(s) CPU architecture, OS, and Version:
root@s1:~# uname -a
Linux s1 5.10.0-8-amd64 #1 SMP Debian 5.10.46-4 (2021-08-03) x86_64 GNU/Linux
root@s1:~# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
s1 Ready control-plane,master 17m v1.21.3+k3s1 10.11.0.101 <none> Debian GNU/Linux 11 (bullseye) 5.10.0-8-amd64 containerd://1.4.8-k3s1
a1 Ready <none> 14m v1.21.3+k3s1 10.11.0.201 <none> Debian GNU/Linux 11 (bullseye) 5.10.0-8-amd64 containerd://1.4.8-k3s1
Cluster Configuration:
1 server (without worker role) and 1 agent as configured by my playground at https://github.com/rgl/k3s-vagrant/tree/debian-11-wip.
Describe the bug:
the traefik dashboard is installed as https://github.com/rgl/k3s-vagrant/blob/debian-11-wip/provision-k3s-server.sh#L62-L167 and is available in all the cluster nodes at:
- http://10.11.0.101:9000/dashboard/ (
s1node) - http://10.11.0.201:9000/dashboard/ (
a1node)
when running in Debian 10, accessing both addresses works fine (they return the expected web page). but when running in Debian 11, the a1 node address does not work (the request times out).
it does not even work when trying a wget http://10.11.0.201:9000/dashboard/ from the s1 node (but works inside the a1 node).
while running tcpdump at the s1 and a1 nodes, I can see the SYN packets leave s1 to a1, but a1 never replies.
do you have any idea why this is happening or what might be blocking this? or any clue how to make it work?
PS when launching the server with --flannel-backend 'host-gw' things seem to work. so it seems there’s something going on with the vxlan backend.
Steps To Reproduce:
PS: That is running in KVM VMs (using virtio nic) on a Ubuntu 20.04 host. The code is at https://github.com/rgl/k3s-vagrant/tree/wip-vxlan.
Expected behavior:
expected it to work in Debian 11, like it does in Debian 10.
Actual behavior:
traffic between nodes in the overlay network does not seem to be working correctly.
Additional context / logs:
- Debian 11 ships by default with cgroup2, but I’ve switched to cgroup1 with https://github.com/rgl/k3s-vagrant/blob/debian-11-wip/provision-cgroup1.sh
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 4
- Comments: 21 (6 by maintainers)
I’m still seeing this problem on
v1.21.4+k3s1on bulleseye, in my case 2 Hetzner cloud instances. For testing I just increased coredns to 2 replicas to have one on each node, then I tested DNS resolution viagcr.io/kubernetes-e2e-test-images/dnsutils:1.3, querying both the local node endpoint and the remote node endpoint.kubectl --context test-hcloud-k3s get pods -A -o widekubectl --context test-hcloud-k3s describe endpoints kube-dns -n kube-systemNow when I try to query the pod on the master node from the dnsutils container on the master node everything is working fine:
kubectl --context test-hcloud-k3s exec -it dnsutils -- dig +search kube-dns.kube-system @10.42.0.4root@test-hc-k3s-master-0:~# tcpdump -i any -nn port 53Now let’s repeat the same, querying the pod on the agent node from the dnstest container on the master node:
kubectl --context test-hcloud-k3s exec -it dnsutils -- dig +search kube-dns.kube-system @10.42.1.2root@test-hc-k3s-master-0:~# tcpdump -i any -nn port 53root@test-hc-k3s-agent-0:~# tcpdump -i any -nn port 53It does seem to be an improvement over
v1.21.3+k3s1though as withv1.21.3+k3s1I don’t even see a response in tcpdump at all:kubectl --context test-hcloud-k3s get pods -A -o widekubectl --context test-hcloud-k3s describe endpoints kube-dns -n kube-systemlocal request within master node:
root@test-hc-k3s-master-0:~# tcpdump -i any -nn port 53request from master node to agent node:
root@test-hc-k3s-master-0:~# tcpdump -i any -nn port 53root@test-hc-k3s-agent-0:~# tcpdump -i any -nn port 53