k3s: DNS failed when using more than one node
I’m chasing this bug for months, pods can’t talk to coredns on nodes agents.
On a fresh 0.8.1 arm64 deployment with 3 nodes (one master, two agents), but same issue existed with previous k3s version, kernel 4.4 or 5.3, host is Arch. iptables v1.8.3 (legacy)
Using default install script
curl -sfL https://get.k3s.io | K3S_URL=https://rk0:6443 K3S_TOKEN=xxxx sh -
Expected: 10.43.0.10 DNS service (and I assume the whole network) should be correctly setup on each nodes. It’s easy to test since the host can’t reach the dns when the problem appears.
dig www.google.com @10.43.0.10
I found that scaling up the coredns deployment displaces the working node to an agent, making the master node unable to reach the DNS.
sudo kubectl scale -n kube-system deployment.v1.apps/coredns --replicas=3
For some reasons, sometimes, it just works from, the 3 nodes, but most of the time it doesn’t. I’ve tried to start the agent manually after the boot sequence is complete with no luck, compared the iptables output, everything is fine … I’ve also tried to point coredns to 8.8.8.8 directly with no result.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 15 (1 by maintainers)
I think it would be okay to have an option for
host-gw, not sure how @ibuildthecloud feels about it.It would be good to get to the bottom of the issue tho.
I have a similar issue, not sure if it’s the same. I noticed this problem when deploying k3s to more than one node. In my case it seems the master node cannot resolve dns while the other nodes can. So any workload ending up on the master fails connecting to things. For example deploying external-dns or cert-manager, if they end up on the master they fail.
I have a 1.0 version of k3s cluster with 3 masters and 2 agents, same problem here, to add some details:
10.43.0.10, other (both master and agents) can’t.hostNetwork: truecan’t resolve via it, normal in cluster network is okay.dig SERVICE @10.43.0.10 +tcpon any node is fine.And I’ve resolved my issue.
Check your firewalls – make sure that your nodes can communicate with each other on UDP port 8472 (assuming you’re using the default VXLAN backend for Flannel).
@akenakh this could explain why host-gw backend was working and VXLAN was not, in your case.
Another workaround/fix is to use NodeLocal DNSCache https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/.
__PILLAR__DNS__SERVER__,__PILLAR__LOCAL__DNS__and__PILLAR__DNS__DOMAIN__with derised values.Hope it helps.