kubernetes: kube-dns : Intermittent dns issue from pods to external server
When try to connect external server from pod getting “Could not resolve host” error intermittent. From all nodes, we are able to connect to the host and get response (not seeing intermittent issue). But from pods, seeing intermittent error as below. Can anyone tell, what could be reason?
bash-4.2$ curl myhost:11102
curl: (52) Empty reply from server
bash-4.2$ curl myhost:11102
curl: (52) Empty reply from server
bash-4.2$curl myhost.com:11102
curl: (6) Could not resolve host: myhost.com
bash-4.2$ curl myhost:11102
curl: (52) Empty reply from server
I tried most of the option mentioned in this ticket https://github.com/kubernetes/kubernetes/issues/22823 but does not helped.
- scale up kube-dns nodes to 4
- rebooted nodes
Still does not fixed this issue.
/sig area/dns @kubernetes/sig-Cluster Ops-misc
Kubectl Version:
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:33:11Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:22:08Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Environment
NAME="Amazon Linux AMI"
VERSION="2016.09"
ID="amzn"
ID_LIKE="rhel fedora"
VERSION_ID="2016.09"
PRETTY_NAME="Amazon Linux AMI 2016.09"
ANSI_COLOR="0;33"
CPE_NAME="cpe:/o:amazon:linux:2016.09:ga"
HOME_URL="http://aws.amazon.com/amazon-linux-ami/"
what happend?
kube dns intermittent does not resolve host name from pods, but always works fine from nodes
Expectation
pods always should resolve external servers like nodes ````
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 23
- Comments: 59 (21 by maintainers)
@jsravn: you can’t re-open an issue/PR unless you authored it or you are assigned to it.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@kachkaev we’ve had many kube-dns + flannel in the past that have turned out to be flannel config, may want to do a search whether or not this is the case…
I was experiencing a similar problem to this, with intermittent DNS issues and dnsmasq set up on the host system (for responding to queries for hostnames in /etc/hosts). We had both the local dnsmasq and upstream DNS servers in resolv.conf.
dnsmasq in the kube-dns pod sends the first query for a name out to all of the servers in resolv.conf, but then chooses a single server for further queries. If it chose dnsmasq on the host, everything worked. If it chose another server, the query would fail.
The solution was to remove the upstream servers from /etc/resolv.conf and add them to /etc/dnsmasq.conf, then restart dnsmasq on the host and the kube-dns pod.
@notmaxx we endup draining and restoring the masters one by one, which “fixed” the issue, but we’ll probably have to update our networking overlay, weave is just causing issues …
After trying with CoreDNS in place of Kube-DNS the behaviour is exactly the same so now it doesn’t look to me that the problem is in dnsmasq. Does anybody have any idea on what that might be?
@marceloboeira i found this interesting https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/
@bowei Thanks for your response.
Could you please advise what’s the best way to get those counters?
I’m not sure what the QPS is, I set Google DNS as upstream servers. I didn’t find them sharing the exact value, but I expect it to be pretty high. At least higher than my cluster should generate: it’s not serving any production traffic and only has two pods hosting web apps. It’s being used by maybe 2-5 testers only.
May I ask you to elaborate on how exceeding QPS could explain the behaviour I observed: sometimes it tries to resolve using local zones only, which fails and sometimes it tries to resolve externally only and works fine? I can’t quite connect all the dots and would very appreciate if you could point me in the right direction.
Thanks.