dns: node-cache is not working in some cases.

I found if a pod config with

dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true

And then the DNS query will go to kube-dns directly instead of node-cache. Here is an example manifest what I tested.

apiVersion: v1
kind: Pod
metadata:
  name: bug-test
  namespace: default
spec:
  containers:
  - command:
    - sh
    - -c
    - |
      watch -n 1 nslookup google.com
    image: library/alpine
    name: bug-test
    resources:
      requests:
        cpu: 10m
        memory: 10Mi
  dnsPolicy: ClusterFirstWithHostNet
  hostNetwork: true

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 20 (12 by maintainers)

Most upvoted comments

Thanks @pacoxu What’s happening in the hostNetwork pods is that the packet is locally generated and sent to kube-dns service ip.

Kube-dns service ip is local on the “nodelocaldns” interface. However, in the NAT OUTPUT chain, this gets DNAT’ed to a kube-dns pod IP. The NOTRACK rules will not help here since those are in raw PREROUTING table.
Once the packet gets DNATed, if it picked a kube-dns pod on a different node, it is sent there, with source ip as the kube-dns service ip, leaving from nodelocaldns interface.
kube-dns pod on the other node gets the packet and replies to kube-dns service ip. That packet is sent locally since there is a nodelocaldns interface on that node as well, with the same IP. so the reply packet does not leave the node at all.

In step 2, if there was a kube-dns pod running on the same node, and the packet was DNATed to that pod ip, it would have worked. Otherwise it will not.

There are 2 solutions to this:

As @pacoxu suggested, add the rule:

iptables -t nat -I OUTPUT 1 -d <kube-dns ip> -o lo -j RETURN

This will ensure packet is not DNATed to the kube-dns pod IP and hence it will be delivered locally. However, this rule has to be the first one in the OUTPUT chain, above the kube-proxy added rules.

Send the packet out, but use the node-ip:

iptables -t nat -A POSTROUTING -s <kube-dns ip> -j MASQUERADE

seems to be the better solution since the requests will continue to use nodelocalcache in that case. Also there will be fewer Address translation steps. I will add this to the node-cache code and try it out.

prameshj on Aug 13, 2019

I meet similar issue and my solution would be adding the output iptables rule below. @axot

iptables -t nat -I OUTPUT 1 -d 10.96.3.45 -o lo -j RETURN

https://github.com/coredns/coredns/issues/3097#issuecomment-520290887

pacoxu on Aug 13, 2019

The issue with the nat OUTPUT rule is that we have to constantly ensure that the rule is the first one in the chain, kube-proxy will be installing the jump to KUBE_SERVICES chain as the first rule as well. So, we could have temporary timeouts still, during the interval when kube-proxy rules are above the nodelocaldns rules. Did you observe this @pacoxu ?

The best option here in my opinion is to use the NOTRACK action like we have done in all the other chains:

iptables -t raw -I OUTPUT -d <nodelocal listen ip> -j NOTRACK

This will ensure that locally generated packets meant for nodelocaldns are not connection tracked/NAT’ed. I just tried this, and it seems to work.

prameshj on Aug 13, 2019