k3s: Network or DNS problem for some pods

I have a k3s cluster that has been running fine for some time but suddenly started having problems with DNS and/or networking. Unfortunately I haven’t been able to determine what caused it or even what exactly the problem is.

This issue seems related but according to that it should be enough to change the coredns ConfigMap and that should already be fixed in this release of k3s.

The first sign of trouble was that metrics-server didn’t report metrics for nodes. I found out that it was because it couldn’t fully scrape metrics and timed out. Further investigation lead me to believe that it wasn’t able to resolve the nodes hostnames.

To work around the first problem, I added the flags --kubelet-insecure-tls and --kubelet-preferred-address-types=InternalIP. It works but I don’t like it, it was working fine before without this.

After this, I realized that this problem was not isolated to metrics-server. Other pods in the cluster are also unable to resolve any hostnames (cluster services or public). I haven’t been able to find a pattern to it. The cert-manager pod can resolve everything correctly, but my test pods cannot resolve anything no matter what host they run on, same as metrics-server. I guess it is relevant also to note that I can reach the internet just fine and lookup any public domain names on the nodes directly. I have also tried changing the coredns ConfigMap to use 8.8.8.8 instead of /etc/resolv.conf.

System description

The cluster consists of 3 Raspberry Pis running Fedora IoT.

$ kubectl get nodes -o wide
NAME     STATUS   ROLES    AGE    VERSION         INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                             KERNEL-VERSION           CONTAINER-RUNTIME
fili     Ready    master   104d   v1.14.1-k3s.4   10.0.0.13     <none>        Fedora 29.20190606.0 (IoT Edition)   5.1.6-200.fc29.aarch64   containerd://1.2.5+unknown
kili     Ready    <none>   97d    v1.14.1-k3s.4   10.0.0.15     <none>        Fedora 29.20190606.0 (IoT Edition)   5.1.6-200.fc29.aarch64   containerd://1.2.5+unknown
pippin   Ready    <none>   41d    v1.14.1-k3s.4   10.0.0.2      <none>        Fedora 29.20190606.0 (IoT Edition)   5.1.6-200.fc29.aarch64   containerd://1.2.5+unknown

Relevant logs

CoreDNS logs messages like the following when one of the pods is trying to reach a service in another namespace (gitea):

2019-06-14T16:36:16.234Z [ERROR] plugin/errors: 2 gitea.gitea. AAAA: unreachable backend: read udp 10.42.4.93:49037->10.0.0.1:53: i/o timeout
2019-06-14T16:36:16.234Z [ERROR] plugin/errors: 2 gitea.gitea. A: unreachable backend: read udp 10.42.4.93:59310->10.0.0.1:53: i/o timeout

This is from the start of the CoreDNS logs:

$ kubectl -n kube-system logs coredns-695688789-lm947 
.:53
2019-06-12T19:01:15.388Z [INFO] CoreDNS-1.3.0
2019-06-12T19:01:15.389Z [INFO] linux/arm64, go1.11.4, c8f0e94
CoreDNS-1.3.0
linux/arm64, go1.11.4, c8f0e94
2019-06-12T19:01:15.389Z [INFO] plugin/reload: Running configuration MD5 = ef347efee19aa82f09972f89f92da1cf
2019-06-12T19:01:36.395Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:60396->10.0.0.1:53: i/o timeout
2019-06-12T19:01:39.397Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:56286->10.0.0.1:53: i/o timeout
2019-06-12T19:01:42.397Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:38791->10.0.0.1:53: i/o timeout
2019-06-12T19:01:45.399Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:39417->10.0.0.1:53: i/o timeout
2019-06-12T19:01:48.401Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:39276->10.0.0.1:53: i/o timeout
2019-06-12T19:01:51.401Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:36239->10.0.0.1:53: i/o timeout
2019-06-12T19:01:54.403Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:47541->10.0.0.1:53: i/o timeout
2019-06-12T19:01:57.404Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:39486->10.0.0.1:53: i/o timeout
2019-06-12T19:02:00.405Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:53211->10.0.0.1:53: i/o timeout
2019-06-12T19:02:03.405Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:53654->10.0.0.1:53: i/o timeout
2019-06-12T20:03:31.063Z [ERROR] plugin/errors: 2 update.containous.cloud. AAAA: unreachable backend: read udp 10.42.4.93:38504->10.0.0.1:53: i/o timeout
2019-06-12T20:03:36.064Z [ERROR] plugin/errors: 2 update.containous.cloud. AAAA: unreachable backend: read udp 10.42.4.93:38491->10.0.0.1:53: i/o timeout
2019-06-12T20:03:41.570Z [ERROR] plugin/errors: 2 api.github.com. AAAA: unreachable backend: read udp 10.42.4.93:56122->10.0.0.1:53: i/o timeout
2019-06-12T20:03:46.572Z [ERROR] plugin/errors: 2 api.github.com. AAAA: unreachable backend: read udp 10.42.4.93:39048->10.0.0.1:53: i/o timeout
2019-06-13T00:00:50.170Z [ERROR] plugin/errors: 2 stats.drone.ci. AAAA: unreachable backend: read udp 10.42.4.93:38093->10.0.0.1:53: i/o timeout

Cert-manager pod has working DNS:

$ kubectl exec -it -n utils cert-manager-66bc958d96-b6b7k -- nslookup gitea.gitea
nslookup: can't resolve '(null)': Name does not resolve

Name:      gitea.gitea
Address 1: 10.43.111.72 gitea.gitea.svc.cluster.local
[lennart@legolas ~]$ kubectl exec -it -n utils cert-manager-66bc958d96-b6b7k -- nslookup www.google.com
nslookup: can't resolve '(null)': Name does not resolve

Name:      www.google.com
Address 1: 216.58.207.228 arn09s19-in-f4.1e100.net
Address 2: 2a00:1450:400f:80c::2004 arn09s19-in-x04.1e100.net

Debugging DNS with busybox pods:

[lennart@legolas ~]$ kubectl get pods -o wide
NAME           READY   STATUS    RESTARTS   AGE    IP            NODE     NOMINATED NODE   READINESS GATES
busybox        1/1     Running   47         2d     10.42.4.90    pippin   <none>           <none>
busybox-fili   1/1     Running   26         25h    10.42.0.132   fili     <none>           <none>
busybox-kili   1/1     Running   1          116m   10.42.2.167   kili     <none>           <none>
[lennart@legolas ~]$ kubectl exec -it  busybox -- nslookup www.google.com
;; connection timed out; no servers could be reached

command terminated with exit code 1
[lennart@legolas ~]$ kubectl exec -it  busybox -- nslookup gitea.gitea
;; connection timed out; no servers could be reached

command terminated with exit code 1
[lennart@legolas ~]$ kubectl exec -it  busybox-fili -- nslookup www.google.com
Server:		10.43.0.10
Address:	10.43.0.10:53

Non-authoritative answer:
Name:	www.google.com
Address: 2a00:1450:400f:80a::2004

*** Can't find www.google.com: No answer

[lennart@legolas ~]$ kubectl exec -it  busybox-fili -- nslookup gitea.gitea
;; connection timed out; no servers could be reached

command terminated with exit code 1
[lennart@legolas ~]$ kubectl exec -it  busybox-kili -- nslookup www.google.com
Server:		10.43.0.10
Address:	10.43.0.10:53

Non-authoritative answer:
Name:	www.google.com
Address: 2a00:1450:400f:807::2004

*** Can't find www.google.com: No answer

[lennart@legolas ~]$ kubectl exec -it  busybox-kili -- nslookup gitea.gitea
;; connection timed out; no servers could be reached

command terminated with exit code 1

Description of coredns ConfigMap:

Data
====
Corefile:
----
.:53 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
      pods insecure
      upstream
      fallthrough in-addr.arpa ip6.arpa
    }
    hosts /etc/coredns/NodeHosts {
      reload 1s
      fallthrough
    }
    prometheus :9153
    proxy . /etc/resolv.conf
    cache 30
    loop
    reload
    loadbalance
}

NodeHosts:
----
10.0.0.13 fili
10.0.0.2 pippin
10.0.0.15 kili

Some IP related prints: fili-ip-route.txt fili-iptables-save.txt kili-ip-route.txt kili-iptables-save.txt pippin-ip-route.txt pippin-iptables-save.txt

If you made it through all that, kudos to you! Sorry for the long description.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 18
  • Comments: 47 (12 by maintainers)

Most upvoted comments

I have encounted this problem many times, This problem bothering me for a long long time.

It seems it was caused by wrong iptable rules, But I didnt find the root cause,

The direct phenomenon is you cannot accsess other services with clusterr ip, so all the pod running on the node cannot access kube-dns service, when I encounter this problem, The following method works:

iptables -F
iptables -X
iptables -F -t nat
iptables -X -t nat

the command above flushes the iptable rules, then restart k3s to recreate the iptable rules, The problem resolved, but I don’t know when it happenes again, because running for hours or days, it happens again.

the fowwling are the iptable rules snapshot(the left is when the node is abnormal, the right is when if normal.): image

@karakanb I’m running a 4-node K3s ARM cluster on Oracle cloud. On all my nodes I have to run this script on boot (systemd multi-user target). This survives reboots and my wireguard backplane:

#!/bin/bash
echo "Clearing iptables"
sudo iptables -P INPUT ACCEPT
sudo iptables -P FORWARD ACCEPT
sudo iptables -P OUTPUT ACCEPT
sudo iptables -t nat -F
sudo iptables -t mangle -F
sudo iptables -F
sudo iptables -X

echo "Restarting wireguard interfaces"
INTERFACES=$(sudo ls /etc/wireguard/)
for i in $INTERFACES
do
	i=$(echo $i | sed 's/\.conf//')
	echo "Disabling $i"
	sudo wg-quick down $i
	echo "Enabling $i"
	sudo wg-quick up $i
done

echo "Restarting docker and k3s"
sudo systemctl restart docker
sudo systemctl restart k3s
sudo systemctl restart k3s-agent
sudo chmod 777 /etc/rancher/k3s/k3s.yaml

echo "Done"

The problem is Oracle Cloud comes with garbage iptables rules I don’t know to how to get rid of, so I nuke the table and let k3s build them back in.

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

I don’t know if this helps anyone else. But on my raspi 4 cluster, I installed docker.io package and that’s when DNS inside the cluster stopped working. apt-get remove docker.io solved this particular issue for me

noticed that the original reporter was also running arm64. I’ve never encountered this problem running on official Raspbian (which I ran for over 6 months). It only started happening when I moved the worker nodes to Ubuntu 21.04 64bit. I’d guess the relevant difference there is that on Raspbian I used legacy iptables?

Im having the same issues, but restarting k3s systemctl restart k3s on master and agents fixes it.

similar to @zackb , after removing docker.io from my master node and rebooting, and killing all pods, everything returned back to normal. Looks like it’s related to docker using the older iptables vs k3s using nftables – mixing both is recipe for disaster it seems.

I should probably also mention that SELinux is set to permissive and Firewalld is disabled.

I don’t understand why but if master uses hetzner, worker also uses hetzner it’s ok, but if worker uses vultr or oracle cloud, dns error occurs

K3s (1.21.5+k3s2) (containerd 1.4.11-k3s1) on Ubuntu 20.04.3 LTS (virtual machines).

I tried all the things I could think of and others that were suggested here. Same problem, my pods would have regular issues resolving DNS entries from my BIND9 DNS server on my LAN. I tried the config map setting it to my local DNS server, changed resolv.conf on each node to bypass Ubuntu 20.04 systems resolvconf config, nothing worked. I finally got tired of beating my head on my desk and thought I’d try to just set it in my deployment for each pod to use, well, this works…what a freaking hack.

For those who need a solid workaround with minimal effort (set DNS policy to None and configure it yourself for each pod): https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

Yea, that was it. I disabled firewalld, much better now. I also fully stopped docker - and I had a reminiscent microk8s running too which I fully uninstalled.

I still have some weirdness in nslookup output and all, but nothing that prevents me from using k3s anymore 👍

Sorry for the noise!

@ericchiang On my situation, no dropping UDP packets. resolve address by service name is ok, only pod name resolving was failed.

dig @XXX some-service.default.svc.cluster.local OK dig @XXX podname-n.some-service.default.svc.cluster.local FAILED with no adress returned