k3s: Pods in different networks are unable to communicate between them.

This might be related to the discussion on https://github.com/rancher/k3s/pull/881, but with some slight differences.

Version:

k3s -v
k3s version v1.18.2+k3s1 (698e444a)

K3s arguments:

curl -sfL https://get.k3s.io | K3S_TOKEN="${token}" K3S_URL="https://${endpoint}:6443/" sh -s - $(scw-userdata k3s_node_labels) --node-external-ip="$(scw-metadata --cached PUBLIC_IP_ADDRESS)"

Describe the bug I am trying to run a cluster which has some nodes hosted on scaleway vps and some in another network/provider.

On scaleway nodes are assigned private ips, but they also have a routable public ip (1:1) and by passing a flag node-external-ip I am able to have a proper internal and external ip assigned to nodes, i.e.

NAME             STATUS   ROLES    AGE    VERSION        INTERNAL-IP     EXTERNAL-IP      OS-IMAGE                       KERNEL-VERSION   CONTAINER-RUNTIME
k3s-server-001   Ready    <none>   23m    v1.18.2+k3s1   10.64.212.xx    212.47.252.xx     Ubuntu 20.04 LTS               5.4.0-1011-kvm   containerd://1.3.3-k3s2
k3s-server-002   Ready    <none>   39m    v1.18.2+k3s1   10.69.66.xx     51.158.109.xx   Ubuntu 20.04 LTS               5.4.0-1011-kvm   containerd://1.3.3-k3s2
k3s-online-01    Ready    master   4h3m   v1.18.2+k3s1   62.210.202.xx   <none>           Debian GNU/Linux 10 (buster)   4.19.0-9-amd64   containerd://1.3.3-k3s2

(above the third node is the master and 2 scaleway’s nodes are workers, it was a test to see if having master on public ip solves this issue or not. Normal situation is that 2 scaleway nodes are master and the k3s-online-01 is slave)

Third node which is situated on another provider has only public ip address.

For test purposes all nodes have been wiped clean, they also had the same OS version (debian buster), no firewall is set up.

The issue is that until I remain in the Scaleway realm, everything works as expected. When I add an external node from a different network it shows up as ready in the cluster, I am able to deploy pods to it and exec into a shell. However I am unable to reach any of the other network, dns resolution doesnt’ work on any address (kubernetes or outside word), and I am only able to ping resources in WWW.

Is there something I am missing? I tried flannel with default/ipsec/wireguard setting with no success so far.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 22 (7 by maintainers)

Most upvoted comments

For now I’ve put up a deployable workaround (https://github.com/alekc-go/flannel-fixer) which involves launching a listener deployment which fixes this annotation on existing and any new node joining the cluster. Will try to chase it/debug on flannel side of things.

alekc on May 25, 2020

After a bit of investigation I am able to shed more light onto this issue.

So, it turns out the problem is indeed in flannel. Even though node have private and public ip correctly reported in kubectl get nodes -o wide, on node annotation flannel reports

    flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"e2:f3:3c:16:6b:e8"}'
    flannel.alpha.coreos.com/backend-type: vxlan
    flannel.alpha.coreos.com/kube-subnet-manager: "true"
    flannel.alpha.coreos.com/public-ip: 10.x.x.x

node public-ip which contains node’s private ip.

After adding an additional annotation to the node

flannel.alpha.coreos.com/public-ip-overwrite: 51.158.109.xxx

and restarting k3s-agent, I can see in logs

3415 kube.go:247] Overriding public ip with '51.158.109.xxx' from node annotation 'flannel.alpha.coreos.com/public-ip-overwrite'

After that everything begins to work again. I need to do some further testing, but so far results are encouraging.

alekc on May 24, 2020