kubeadm: Support for more than one DNS Deployment/Service for faster failover

What keywords did you search in kubeadm issues before filing this one?

“dns”, also skimmed the kubeadm docs looking for a solution

Is this a BUG REPORT or FEATURE REQUEST?

FEATURE REQUEST

Problem description:

When a coredns pod goes down (for example, due to node running out of disk space, or some other unpredictable event), it takes some time for kubelet to declare the pod “not ready”, and then for it to be removed from the Endpoints and finally from the dataplane. During that time, DNS clients can see their DNS requests and their retries all go to the failed pod (and thus get no response).

This is due to an interaction between flow-based load balancing for UDP (as used by kube-proxy) and the behaviour of (at least glibc’s) DNS resolver:

When the glibc resolver only has one nameserver and it retries a DNS query, it re-uses the same source port.
This results in all the retry packets being classified as part of one flow by conntrack.
Once a backend coredns pod has been chosen for the flow, the retry packets all go tot he same pod.
If the backend pod has failed, the DNS resolution fails until the pod is marked as non-ready (plus extra time for kube-proxy to clean up the dataplane state).

The impact is that, with default configuration, DNS resolution times out instead of the retry going to a good pod. This can be fatal for many long-lived applications.

We saw the same behaviour in iptables mode, IPVS mode and it’s likely that thirdparty eBPF dataplanes suffer the same problem.

Suggested enhancement

Since changing DNS resolver behaviour is infeasible (and none of glibc’s standard configuration seems to be of much use here), I think the best solution is to:

Deploy two kube-dns deployments
Deploy two kube-dns services, say with well-known IPs 10.96.0.10 and 10.96.0.11
Configure both of those as nameservers for the pods.

Then, if any one coredns pod goes down, it will affect only one service and the resolver will naturally retry via the other service.

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (9 by maintainers)

Most upvoted comments

https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/dns/nodelocaldns helps during dns upgrades or downtime as there is a local cache.

low termination period helps for upgrade; for node network issue, it will wait until the node is not ready(details in https://github.com/kubernetes-sigs/kubespray/blob/master/docs/kubernetes-reliability.md#fast-update-and-fast-reaction)

/sig network /area dns

pacoxu on Apr 30, 2023