k3s: DNS failed when using more than one node

I’m chasing this bug for months, pods can’t talk to coredns on nodes agents.

On a fresh 0.8.1 arm64 deployment with 3 nodes (one master, two agents), but same issue existed with previous k3s version, kernel 4.4 or 5.3, host is Arch. iptables v1.8.3 (legacy)

Using default install script

curl -sfL https://get.k3s.io | K3S_URL=https://rk0:6443 K3S_TOKEN=xxxx sh -

Expected: 10.43.0.10 DNS service (and I assume the whole network) should be correctly setup on each nodes. It’s easy to test since the host can’t reach the dns when the problem appears.

dig www.google.com @10.43.0.10

I found that scaling up the coredns deployment displaces the working node to an agent, making the master node unable to reach the DNS.

sudo kubectl scale -n kube-system deployment.v1.apps/coredns --replicas=3

For some reasons, sometimes, it just works from, the 3 nodes, but most of the time it doesn’t. I’ve tried to start the agent manually after the boot sequence is complete with no luck, compared the iptables output, everything is fine … I’ve also tried to point coredns to 8.8.8.8 directly with no result.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 1
Comments: 15 (1 by maintainers)

Most upvoted comments

I think it would be okay to have an option for host-gw, not sure how @ibuildthecloud feels about it.

It would be good to get to the bottom of the issue tho.

erikwilson on Oct 17, 2019

I have a similar issue, not sure if it’s the same. I noticed this problem when deploying k3s to more than one node. In my case it seems the master node cannot resolve dns while the other nodes can. So any workload ending up on the master fails connecting to things. For example deploying external-dns or cert-manager, if they end up on the master they fail.

johnae on Nov 5, 2019

I have a 1.0 version of k3s cluster with 3 masters and 2 agents, same problem here, to add some details:

only the node run coredns can resolve via 10.43.0.10, other (both master and agents) can’t.
pods with hostNetwork: true can’t resolve via it, normal in cluster network is okay.
TCP works everywhere, eg: dig SERVICE @10.43.0.10 +tcp on any node is fine.

iyzsong on Dec 6, 2019

And I’ve resolved my issue.

Check your firewalls – make sure that your nodes can communicate with each other on UDP port 8472 (assuming you’re using the default VXLAN backend for Flannel).

@akenakh this could explain why host-gw backend was working and VXLAN was not, in your case.

e3b0c442 on May 10, 2020

Another workaround/fix is to use NodeLocal DNSCache https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/.

Download the yaml: https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml
Replace __PILLAR__DNS__SERVER__, __PILLAR__LOCAL__DNS__ and __PILLAR__DNS__DOMAIN__ with derised values.
The apply it.

Hope it helps.

iyzsong on Dec 17, 2019