dns: node-cache is not working in some cases.

I found if a pod config with

  • dnsPolicy: ClusterFirstWithHostNet
  • hostNetwork: true

And then the DNS query will go to kube-dns directly instead of node-cache. Here is an example manifest what I tested.

apiVersion: v1
kind: Pod
metadata:
  name: bug-test
  namespace: default
spec:
  containers:
  - command:
    - sh
    - -c
    - |
      watch -n 1 nslookup google.com
    image: library/alpine
    name: bug-test
    resources:
      requests:
        cpu: 10m
        memory: 10Mi
  dnsPolicy: ClusterFirstWithHostNet
  hostNetwork: true

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 20 (12 by maintainers)

Most upvoted comments

Thanks @pacoxu What’s happening in the hostNetwork pods is that the packet is locally generated and sent to kube-dns service ip.

  1. Kube-dns service ip is local on the “nodelocaldns” interface. However, in the NAT OUTPUT chain, this gets DNAT’ed to a kube-dns pod IP. The NOTRACK rules will not help here since those are in raw PREROUTING table.
  2. Once the packet gets DNATed, if it picked a kube-dns pod on a different node, it is sent there, with source ip as the kube-dns service ip, leaving from nodelocaldns interface.
  3. kube-dns pod on the other node gets the packet and replies to kube-dns service ip. That packet is sent locally since there is a nodelocaldns interface on that node as well, with the same IP. so the reply packet does not leave the node at all.

In step 2, if there was a kube-dns pod running on the same node, and the packet was DNATed to that pod ip, it would have worked. Otherwise it will not.

There are 2 solutions to this:

  1. As @pacoxu suggested, add the rule:
iptables -t nat -I OUTPUT 1 -d <kube-dns ip> -o lo -j RETURN

This will ensure packet is not DNATed to the kube-dns pod IP and hence it will be delivered locally. However, this rule has to be the first one in the OUTPUT chain, above the kube-proxy added rules.

  1. Send the packet out, but use the node-ip:
iptables -t nat -A POSTROUTING -s <kube-dns ip> -j MASQUERADE
  1. seems to be the better solution since the requests will continue to use nodelocalcache in that case. Also there will be fewer Address translation steps. I will add this to the node-cache code and try it out.

I meet similar issue and my solution would be adding the output iptables rule below. @axot

iptables -t nat -I OUTPUT 1 -d 10.96.3.45 -o lo -j RETURN

https://github.com/coredns/coredns/issues/3097#issuecomment-520290887

The issue with the nat OUTPUT rule is that we have to constantly ensure that the rule is the first one in the chain, kube-proxy will be installing the jump to KUBE_SERVICES chain as the first rule as well. So, we could have temporary timeouts still, during the interval when kube-proxy rules are above the nodelocaldns rules. Did you observe this @pacoxu ?

The best option here in my opinion is to use the NOTRACK action like we have done in all the other chains:

iptables -t raw -I OUTPUT -d <nodelocal listen ip> -j NOTRACK

This will ensure that locally generated packets meant for nodelocaldns are not connection tracked/NAT’ed. I just tried this, and it seems to work.