kubernetes: dnsPolicy in hostNetwork not working as expected

What happened: In kubernetes 1.17, pods running with hostNetwork: true are not able to get dns answers from the coredns-service - especially if using the strongly recommended clusterPolicy: ClusterFirstWithHostNet

Also, I noticed that the coredns Service seems to be not always reachable from the host itself.

What you expected to happen: The coredns Service is reachable from within the pod in the hostNetwork, especially when using clusterPolicy: ClusterFirstWithHostNet. Also, the coredns Service is reachable from the host, like this is in kubernetes 1.15

How to reproduce it (as minimally and precisely as possible):

# kubectl -n kube-system get svc
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   97d
# dig @10.96.0.10 kubernetes.io

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @10.96.0.10 kubernetes.io
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
# cat dns-pods-in-host-network.yaml
# kubectl apply -f dns-pods-in-host-network.yaml
---
apiVersion: v1
kind: Pod
metadata:
  name: cluster-first
  namespace: default
spec:
  containers:
  - name: dnsutils
    image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
  restartPolicy: Always
  hostNetwork: true
  dnsPolicy: ClusterFirst
---
apiVersion: v1
kind: Pod
metadata:
  name: cluster-first-with-hostnet
  namespace: default
spec:
  containers:
  - name: dnsutils
    image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
  restartPolicy: Always
  dnsPolicy: ClusterFirstWithHostNet
  hostNetwork: true
root@master:/tmp# kubectl exec -ti cluster-first -- nslookup kubernetes.io
Server:         1.1.1.1
Address:        1.1.1.1#53

Non-authoritative answer:
Name:   kubernetes.io
Address: 147.75.40.148

root@master:/tmp# kubectl exec -ti cluster-first-with-hostnet -- nslookup kubernetes.io
;; connection timed out; no servers could be reached

command terminated with exit code 1

Anything else we need to know?: I noticed this on three small clusters with kubernetes 1.17, each running with 1 master and 2 or 3 nodes. Most of them were upgraded from lower kubernetes-versions (e.g. starting from 1.13 -> 1.14 -> 1.15 -> 1.16 -> 1.17)

Environment:

  • Kubernetes version (use kubectl version): 1.17
  • Cloud provider or hardware configuration: BareMetal, mostly running on VMware
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a): Linux eins 4.15.0-74-generic #83~16.04.1-Ubuntu SMP Wed Dec 18 04:56:23 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: kubeadm
  • Network plugin and version (if this is a network-related bug): flannel

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 22 (11 by maintainers)

Most upvoted comments

Did find a workaround: switching flannel to host-gw instead of vxlan: https://github.com/coreos/flannel/issues/1245#issuecomment-582612891

kubectl edit cm -n kube-system kube-flannel-cfg
  • replace vxlan with host-gw
  • save
  • not sure if needed, but I did it anyway: kubectl delete pods -l app=flannel -n kube-system

Seeing the same thing when trying to run kiam on 1.17, seen the issue from at least rc.2 through 1.17.3, but wasn’t sure at the time where the issue was.

Ticket I logged w/ the kiam folks: uswitch/kiam#378 Ticket I logged w/ the kops folks: kubernetes/kops#8562

As this seems networking related, I will state we are running Canal CNI, and IPVS mode of kube-proxy.

(Correction, we were on Canal, not Calico, which means the mentioned Flannel issue is likely at the root of it…)

I have kinda the same issue. … only the pod that is on the same host as the master (where the service is pointing to) can reach the service. … using the serviceIP I cant curl it, using the podIP I can curl it directly

that sounds more related to a connectivity problem, maybe related to the CNI, maybe related to iptables, … that you are not able to access the services from the nodes 🤷‍♂

If I do a kubectl run and it ends up on the same node: zero issues. Will try if I can adjust the pod to dns by podIP and see if that connects