k3s: Network or DNS problem for some pods
I have a k3s cluster that has been running fine for some time but suddenly started having problems with DNS and/or networking. Unfortunately I haven’t been able to determine what caused it or even what exactly the problem is.
This issue seems related but according to that it should be enough to change the coredns ConfigMap and that should already be fixed in this release of k3s.
The first sign of trouble was that metrics-server didn’t report metrics for nodes. I found out that it was because it couldn’t fully scrape metrics and timed out. Further investigation lead me to believe that it wasn’t able to resolve the nodes hostnames.
To work around the first problem, I added the flags --kubelet-insecure-tls and --kubelet-preferred-address-types=InternalIP. It works but I don’t like it, it was working fine before without this.
After this, I realized that this problem was not isolated to metrics-server. Other pods in the cluster are also unable to resolve any hostnames (cluster services or public). I haven’t been able to find a pattern to it. The cert-manager pod can resolve everything correctly, but my test pods cannot resolve anything no matter what host they run on, same as metrics-server.
I guess it is relevant also to note that I can reach the internet just fine and lookup any public domain names on the nodes directly.
I have also tried changing the coredns ConfigMap to use 8.8.8.8 instead of /etc/resolv.conf.
System description
The cluster consists of 3 Raspberry Pis running Fedora IoT.
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
fili Ready master 104d v1.14.1-k3s.4 10.0.0.13 <none> Fedora 29.20190606.0 (IoT Edition) 5.1.6-200.fc29.aarch64 containerd://1.2.5+unknown
kili Ready <none> 97d v1.14.1-k3s.4 10.0.0.15 <none> Fedora 29.20190606.0 (IoT Edition) 5.1.6-200.fc29.aarch64 containerd://1.2.5+unknown
pippin Ready <none> 41d v1.14.1-k3s.4 10.0.0.2 <none> Fedora 29.20190606.0 (IoT Edition) 5.1.6-200.fc29.aarch64 containerd://1.2.5+unknown
Relevant logs
CoreDNS logs messages like the following when one of the pods is trying to reach a service in another namespace (gitea):
2019-06-14T16:36:16.234Z [ERROR] plugin/errors: 2 gitea.gitea. AAAA: unreachable backend: read udp 10.42.4.93:49037->10.0.0.1:53: i/o timeout
2019-06-14T16:36:16.234Z [ERROR] plugin/errors: 2 gitea.gitea. A: unreachable backend: read udp 10.42.4.93:59310->10.0.0.1:53: i/o timeout
This is from the start of the CoreDNS logs:
$ kubectl -n kube-system logs coredns-695688789-lm947
.:53
2019-06-12T19:01:15.388Z [INFO] CoreDNS-1.3.0
2019-06-12T19:01:15.389Z [INFO] linux/arm64, go1.11.4, c8f0e94
CoreDNS-1.3.0
linux/arm64, go1.11.4, c8f0e94
2019-06-12T19:01:15.389Z [INFO] plugin/reload: Running configuration MD5 = ef347efee19aa82f09972f89f92da1cf
2019-06-12T19:01:36.395Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:60396->10.0.0.1:53: i/o timeout
2019-06-12T19:01:39.397Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:56286->10.0.0.1:53: i/o timeout
2019-06-12T19:01:42.397Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:38791->10.0.0.1:53: i/o timeout
2019-06-12T19:01:45.399Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:39417->10.0.0.1:53: i/o timeout
2019-06-12T19:01:48.401Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:39276->10.0.0.1:53: i/o timeout
2019-06-12T19:01:51.401Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:36239->10.0.0.1:53: i/o timeout
2019-06-12T19:01:54.403Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:47541->10.0.0.1:53: i/o timeout
2019-06-12T19:01:57.404Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:39486->10.0.0.1:53: i/o timeout
2019-06-12T19:02:00.405Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:53211->10.0.0.1:53: i/o timeout
2019-06-12T19:02:03.405Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:53654->10.0.0.1:53: i/o timeout
2019-06-12T20:03:31.063Z [ERROR] plugin/errors: 2 update.containous.cloud. AAAA: unreachable backend: read udp 10.42.4.93:38504->10.0.0.1:53: i/o timeout
2019-06-12T20:03:36.064Z [ERROR] plugin/errors: 2 update.containous.cloud. AAAA: unreachable backend: read udp 10.42.4.93:38491->10.0.0.1:53: i/o timeout
2019-06-12T20:03:41.570Z [ERROR] plugin/errors: 2 api.github.com. AAAA: unreachable backend: read udp 10.42.4.93:56122->10.0.0.1:53: i/o timeout
2019-06-12T20:03:46.572Z [ERROR] plugin/errors: 2 api.github.com. AAAA: unreachable backend: read udp 10.42.4.93:39048->10.0.0.1:53: i/o timeout
2019-06-13T00:00:50.170Z [ERROR] plugin/errors: 2 stats.drone.ci. AAAA: unreachable backend: read udp 10.42.4.93:38093->10.0.0.1:53: i/o timeout
Cert-manager pod has working DNS:
$ kubectl exec -it -n utils cert-manager-66bc958d96-b6b7k -- nslookup gitea.gitea
nslookup: can't resolve '(null)': Name does not resolve
Name: gitea.gitea
Address 1: 10.43.111.72 gitea.gitea.svc.cluster.local
[lennart@legolas ~]$ kubectl exec -it -n utils cert-manager-66bc958d96-b6b7k -- nslookup www.google.com
nslookup: can't resolve '(null)': Name does not resolve
Name: www.google.com
Address 1: 216.58.207.228 arn09s19-in-f4.1e100.net
Address 2: 2a00:1450:400f:80c::2004 arn09s19-in-x04.1e100.net
Debugging DNS with busybox pods:
[lennart@legolas ~]$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
busybox 1/1 Running 47 2d 10.42.4.90 pippin <none> <none>
busybox-fili 1/1 Running 26 25h 10.42.0.132 fili <none> <none>
busybox-kili 1/1 Running 1 116m 10.42.2.167 kili <none> <none>
[lennart@legolas ~]$ kubectl exec -it busybox -- nslookup www.google.com
;; connection timed out; no servers could be reached
command terminated with exit code 1
[lennart@legolas ~]$ kubectl exec -it busybox -- nslookup gitea.gitea
;; connection timed out; no servers could be reached
command terminated with exit code 1
[lennart@legolas ~]$ kubectl exec -it busybox-fili -- nslookup www.google.com
Server: 10.43.0.10
Address: 10.43.0.10:53
Non-authoritative answer:
Name: www.google.com
Address: 2a00:1450:400f:80a::2004
*** Can't find www.google.com: No answer
[lennart@legolas ~]$ kubectl exec -it busybox-fili -- nslookup gitea.gitea
;; connection timed out; no servers could be reached
command terminated with exit code 1
[lennart@legolas ~]$ kubectl exec -it busybox-kili -- nslookup www.google.com
Server: 10.43.0.10
Address: 10.43.0.10:53
Non-authoritative answer:
Name: www.google.com
Address: 2a00:1450:400f:807::2004
*** Can't find www.google.com: No answer
[lennart@legolas ~]$ kubectl exec -it busybox-kili -- nslookup gitea.gitea
;; connection timed out; no servers could be reached
command terminated with exit code 1
Description of coredns ConfigMap:
Data
====
Corefile:
----
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
}
hosts /etc/coredns/NodeHosts {
reload 1s
fallthrough
}
prometheus :9153
proxy . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
NodeHosts:
----
10.0.0.13 fili
10.0.0.2 pippin
10.0.0.15 kili
Some IP related prints: fili-ip-route.txt fili-iptables-save.txt kili-ip-route.txt kili-iptables-save.txt pippin-ip-route.txt pippin-iptables-save.txt
If you made it through all that, kudos to you! Sorry for the long description.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 18
- Comments: 47 (12 by maintainers)
I have encounted this problem many times, This problem bothering me for a long long time.
It seems it was caused by wrong iptable rules, But I didnt find the root cause,
The direct phenomenon is you cannot accsess other services with clusterr ip, so all the pod running on the node cannot access kube-dns service, when I encounter this problem, The following method works:
the command above flushes the iptable rules, then restart k3s to recreate the iptable rules, The problem resolved, but I don’t know when it happenes again, because running for hours or days, it happens again.
the fowwling are the iptable rules snapshot(the left is when the node is abnormal, the right is when if normal.):
@karakanb I’m running a 4-node K3s ARM cluster on Oracle cloud. On all my nodes I have to run this script on boot (systemd multi-user target). This survives reboots and my wireguard backplane:
The problem is Oracle Cloud comes with garbage iptables rules I don’t know to how to get rid of, so I nuke the table and let k3s build them back in.
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.
I don’t know if this helps anyone else. But on my raspi 4 cluster, I installed docker.io package and that’s when DNS inside the cluster stopped working.
apt-get remove docker.iosolved this particular issue for menoticed that the original reporter was also running arm64. I’ve never encountered this problem running on official Raspbian (which I ran for over 6 months). It only started happening when I moved the worker nodes to Ubuntu 21.04 64bit. I’d guess the relevant difference there is that on Raspbian I used legacy iptables?
Im having the same issues, but restarting k3s
systemctl restart k3son master and agents fixes it.similar to @zackb , after removing
docker.iofrom my master node and rebooting, and killing all pods, everything returned back to normal. Looks like it’s related to docker using the olderiptablesvs k3s usingnftables– mixing both is recipe for disaster it seems.I should probably also mention that SELinux is set to permissive and Firewalld is disabled.
https://rancher.com/docs/k3s/latest/en/installation/network-options/#flannel-options
I don’t understand why but if master uses hetzner, worker also uses hetzner it’s ok, but if worker uses vultr or oracle cloud, dns error occurs
K3s (1.21.5+k3s2) (containerd 1.4.11-k3s1) on Ubuntu 20.04.3 LTS (virtual machines).
I tried all the things I could think of and others that were suggested here. Same problem, my pods would have regular issues resolving DNS entries from my BIND9 DNS server on my LAN. I tried the config map setting it to my local DNS server, changed resolv.conf on each node to bypass Ubuntu 20.04 systems resolvconf config, nothing worked. I finally got tired of beating my head on my desk and thought I’d try to just set it in my deployment for each pod to use, well, this works…what a freaking hack.
For those who need a solid workaround with minimal effort (set DNS policy to None and configure it yourself for each pod): https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/
Yea, that was it. I disabled firewalld, much better now. I also fully stopped docker - and I had a reminiscent microk8s running too which I fully uninstalled.
I still have some weirdness in nslookup output and all, but nothing that prevents me from using k3s anymore 👍
Sorry for the noise!
@ericchiang On my situation, no dropping UDP packets. resolve address by service name is ok, only pod name resolving was failed.
dig @XXX some-service.default.svc.cluster.local OK dig @XXX podname-n.some-service.default.svc.cluster.local FAILED with no adress returned