kubernetes: NodeLocal DNSCache breaks external DNS updates

What happened:

Setup NodeLocal DNSCache as documented. While this works and avoids DNS resolution errors when nodes or sites are lost, it also prevents RFC 2136 connections from external-dns to cluster-external DNS servers.

time="2021-03-03T14:06:35Z" level=info msg="Instantiating new Kubernetes client"
time="2021-03-03T14:06:35Z" level=info msg="Using inCluster-config based on serviceaccount-token"
time="2021-03-03T14:06:35Z" level=info msg="Created Kubernetes client https://172.31.0.1:443"
time="2021-03-03T14:06:37Z" level=info msg="Configured RFC2136 with zone 'xxx.company.com.' and nameserver 'n0211.xxx.company.com:53'"
time="2021-03-03T14:31:58Z" level=error msg="failed to fetch records via AXFR: dial tcp: i/o timeout"
time="2021-03-03T14:32:59Z" level=error msg="failed to fetch records via AXFR: dial tcp: i/o timeout"
time="2021-03-03T14:34:00Z" level=error msg="failed to fetch records via AXFR: dial tcp: i/o timeout"
time="2021-03-03T14:35:00Z" level=error msg="failed to fetch records via AXFR: dial tcp: i/o timeout"
time="2021-03-03T14:36:00Z" level=error msg="failed to fetch records via AXFR: dial tcp: i/o timeout"
time="2021-03-03T14:37:00Z" level=error msg="failed to fetch records via AXFR: dial tcp: i/o timeout"
time="2021-03-03T14:38:01Z" level=error msg="failed to fetch records via AXFR: dial tcp: i/o timeout"

Looks like the netfilter rules explicitly deny requests to all other DNS connections:

Chain INPUT (policy ACCEPT 201 packets, 177K bytes)
 pkts bytes target     prot opt in     out     source               destination         
6917K 4056M cali-INPUT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:Cz_u1IQiXIMmKD4c */
    0     0 ACCEPT     udp  --  *      *       0.0.0.0/0            172.31.0.10          udp dpt:53
    0     0 ACCEPT     tcp  --  *      *       0.0.0.0/0            172.31.0.10          tcp dpt:53
    2   174 ACCEPT     udp  --  *      *       0.0.0.0/0            169.254.20.10        udp dpt:53
    0     0 ACCEPT     tcp  --  *      *       0.0.0.0/0            169.254.20.10        tcp dpt:53
4002K 2803M KUBE-FIREWALL  all  --  *      *       0.0.0.0/0            0.0.0.0/0           

[…]
Chain OUTPUT (policy ACCEPT 391 packets, 229K bytes)
 pkts bytes target     prot opt in     out     source               destination         
7002K 1893M cali-OUTPUT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:tVnHkvAo15HuiPy0 */
    0     0 ACCEPT     udp  --  *      *       172.31.0.10          0.0.0.0/0            udp spt:53
    0     0 ACCEPT     tcp  --  *      *       172.31.0.10          0.0.0.0/0            tcp spt:53
 5852 1068K ACCEPT     udp  --  *      *       169.254.20.10        0.0.0.0/0            udp spt:53
    0     0 ACCEPT     tcp  --  *      *       169.254.20.10        0.0.0.0/0            tcp spt:53
7007K 1894M KUBE-FIREWALL  all  --  *      *       0.0.0.0/0            0.0.0.0/0

What you expected to happen:

NodeLocal DNSCache should allow connections to (selected) external DNS servers so applications and services can advertise themselves.

How to reproduce it (as minimally and precisely as possible):

Deploy NodeLocal DNSCache as described
Connect to an external DNS server (e.g., host 8.8.8.8 8.8.8.8)

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.19.8
Cloud provider or hardware configuration: On-premises, bare-metal
OS (e.g: cat /etc/os-release): Ubuntu 18.04.5 LTS
Kernel (e.g. uname -a): Linux n0214 5.4.0-60-generic #67~18.04.1-Ubuntu SMP Tue Jan 5 22:01:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Install tools: kubeadm
Network plugin and version (if this is a network-related bug): Calico
Others: External DNS server is BIND9

(Possibly) related issues:

/sig network

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 17 (13 by maintainers)

Most upvoted comments

Some more testing showed that NodeLocal DNSCache does not really block traffic to the external DNS server but changes DNS resolution of external FQDN somehow: Changing the external DNS server to an absolute FQDN (e.g., --rfc2136-host=my-dns-server.company.com. instead of my-dns-server.company.com, note the trailing dot) makes it work again.

For other Pods I noticed occasional delays for FQDN resolution even if the same name has been resolved quickly a second before. With some Alpine Linux based Pods, hostname resolution of cluster-external names sometimes fails completely and only absolute names work.

Weird. I cannot observe any of these effects without NodeLocal DNSCache so far.

stephan2012 on Mar 4, 2021