kubernetes: [1.14-beta2] coredns imediatelly crash if Kubernetes API is unavailable

What happened:

In Kubernetes 1.14 beta2, if kube-apiserver is shortly unavailable (e.g. restart), all CoreDNS pod will crash which cause a short DNS outage

What you expected to happen:

Like in Kubernetes 1.13, if kube-apiserver is down, CoreDNS continue to run and provide DNS service.

How to reproduce it (as minimally and precisely as possible):

  • Create cluster with kubeadm & calico (any CNI should work).
  • Wait for CoreDNS to start
  • Restart kube-apiserver
  • CoreDNS has crashed (it didn’t crashed in Kubernetes 1.13 using the same steps)
# POD_CIDR=10.244.0.0/16
# kubeadm init --pod-network-cidr $POD_CIDR --kubernetes-version v1.14.0-beta.2
[...]
# curl https://docs.projectcalico.org/v3.6/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml | sed -e "s?192.168.0.0/16?$POD_CIDR?g" | kubectl apply -f -
# kubectl get po -n kube-system
NAME                                       READY   STATUS    RESTARTS   AGE
[...]
coredns-fb8b8dccf-k8hc2                    1/1     Running   0          26m
coredns-fb8b8dccf-wfbwg                    1/1     Running   0          26m
[...]
# docker rm -f k8s_kube-apiserver_kube-apiserver-node-3_kube-system_1e79449d50c9f3add3dd82d2706ed2f3_0
# kubectl get po -n kube-system
NAME                                       READY   STATUS             RESTARTS   AGE
coredns-fb8b8dccf-k8hc2                    0/1     Running            1          27m
coredns-fb8b8dccf-wfbwg                    0/1     CrashLoopBackOff   1          27m

Anything else we need to know?:

Log of crashed CoreDNS:

# docker logs k8s_coredns_coredns-fb8b8dccf-wfbwg_kube-system_d872299d-4745-11e9-9462-080027c5f494_1
E0315 17:41:50.400737       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: connection refused
E0315 17:41:50.400737       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: connection refused
log: exiting because of error: log: cannot create log: open /tmp/coredns.coredns-fb8b8dccf-wfbwg.unknownuser.log.ERROR.20190315-174150.1: no such file or directory

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0-beta.2.66+846a82fecc6959", GitCommit:"846a82fecc69594712040f715d5447bcd445b9c2", GitTreeState:"clean", BuildDate:"2019-03-15T09:35:50Z", GoVersion:"go1.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0-beta.2", GitCommit:"b1e389e6f7bd798a8dd162f82b918f509ac5291b", GitTreeState:"clean", BuildDate:"2019-03-12T18:01:33Z", GoVersion:"go1.12", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: Bare-metal (virtualbox)
  • OS (e.g: cat /etc/os-release):
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.1 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
  • Kernel (e.g. uname -a):
Linux node-3 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
# docker run --rm -ti k8s.gcr.io/coredns:1.3.1 --version
CoreDNS-1.3.1
linux/amd64, go1.11.4, 6b56a9c

/sig network

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 54 (34 by maintainers)

Most upvoted comments

I had the same problem too:

[root@k8s-master01 ~]# kubectl -n kube-system get pod
NAME                                       READY   STATUS             RESTARTS   AGE
calico-kube-controllers-5cbcccc885-jk6jv   1/1     Running            3          72m
calico-node-2h9kq                          1/1     Running            1          68m
calico-node-68tgh                          1/1     Running            1          69m
calico-node-rk44m                          1/1     Running            1          67m
calico-node-rxvsh                          1/1     Running            1          72m
coredns-8686dcc4fd-jqrr5                   0/1     CrashLoopBackOff   15         80m
coredns-8686dcc4fd-tkc2k                   0/1     CrashLoopBackOff   15         80m
etcd-k8s-master01                          1/1     Running            1          80m
etcd-k8s-master02                          1/1     Running            1          69m
etcd-k8s-master03                          1/1     Running            1          68m
kube-apiserver-k8s-master01                1/1     Running            3          80m
kube-apiserver-k8s-master02                1/1     Running            2          69m
kube-apiserver-k8s-master03                1/1     Running            1          68m
kube-controller-manager-k8s-master01       1/1     Running            4          80m
kube-controller-manager-k8s-master02       1/1     Running            2          68m
kube-controller-manager-k8s-master03       1/1     Running            1          68m
kube-proxy-csgvp                           1/1     Running            1          67m
kube-proxy-jtqnj                           1/1     Running            1          68m
kube-proxy-w6j2t                           1/1     Running            1          80m
kube-proxy-wtd2t                           1/1     Running            1          69m
kube-scheduler-k8s-master01                1/1     Running            4          80m
kube-scheduler-k8s-master02                1/1     Running            1          67m
kube-scheduler-k8s-master03                1/1     Running            1          68m

the error logs

[root@k8s-master01 ~]# kubectl -n kube-system logs -f coredns-8686dcc4fd-jqrr5
.:53
2019-04-20T15:58:59.413Z [INFO] CoreDNS-1.3.1
2019-04-20T15:58:59.413Z [INFO] linux/amd64, go1.11.4, 6b56a9c
CoreDNS-1.3.1
linux/amd64, go1.11.4, 6b56a9c
2019-04-20T15:58:59.413Z [INFO] plugin/reload: Running configuration MD5 = 599b9eb76b8c147408aed6a0bbe0f669
E0420 15:59:24.412593       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0420 15:59:24.412593       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
log: exiting because of error: log: cannot create log: open /tmp/coredns.coredns-8686dcc4fd-jqrr5.unknownuser.log.ERROR.20190420-155924.1: no such file or directory
[root@k8s-master01 ~]# 

the env

[root@k8s-master01 ~]# docker images | grep coredns
registry.aliyuncs.com/google_containers/coredns                   1.3.1               eb516548c180        3 months ago        40.3MB
[root@k8s-master01 ~]# docker images | grep api
registry.aliyuncs.com/google_containers/kube-apiserver            v1.14.1             ecf910f40d6e        3 weeks ago         210MB
[root@k8s-master01 ~]# docker images | grep calico
calico/node                                                       v3.6.1              b4d7c4247c3a        3 weeks ago         73.2MB
calico/cni                                                        v3.6.1              c7d27197e298        3 weeks ago         84.3MB
calico/kube-controllers                                           v3.6.1              0bd1f99c7034        3 weeks ago         50.9MB
[root@k8s-master01 ~]# docker --version
Docker version 18.06.2-ce, build 6d37f41
[root@k8s-master01 ~]# 
[root@k8s-master01 ~]# uname -sr
Linux 3.10.0-957.10.1.el7.x86_64
[root@k8s-master01 ~]# cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core) 
[root@k8s-master01 ~]# 

i try to config emptydir ,without effect

https://www.reddit.com/r/kubernetes/comments/bbok8w/coredns_fails_on_node/

kubectl -n kube-system patch deployment coredns --patch '{"spec":{"template":{"spec":{"volumes":[{"name":"emptydir-tmp","emptyDir":{}}],"containers":[{"name":"coredns","volumeMounts":[{"name":"emptydir-tmp","mountPath":"/tmp"}]}]}}}}'

update coredns to 1.5.0 resolve my problems

https://github.com/coredns/deployment/blob/master/kubernetes/Upgrading_CoreDNS.md

if you update the coredns to 1.4.0,without effect too,maybe you must update it to 1.5.0

the result

[root@k8s-master01 ~]# kubectl -n kube-system get pod 
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-5cbcccc885-926xm   1/1     Running   5          79m
calico-node-5dpgh                          1/1     Running   3          76m
calico-node-6wchl                          1/1     Running   3          74m
calico-node-mb88j                          1/1     Running   3          78m
calico-node-vv2jn                          1/1     Running   3          79m
coredns-66ff4bdb7d-8g6cb                   1/1     Running   1          28m
coredns-66ff4bdb7d-qv94m                   1/1     Running   1          28m
etcd-k8s-master01                          1/1     Running   3          80m
etcd-k8s-master02                          1/1     Running   3          78m
etcd-k8s-master03                          1/1     Running   3          76m
kube-apiserver-k8s-master01                1/1     Running   4          80m
kube-apiserver-k8s-master02                1/1     Running   3          78m
kube-apiserver-k8s-master03                1/1     Running   5          76m
kube-controller-manager-k8s-master01       1/1     Running   5          80m
kube-controller-manager-k8s-master02       1/1     Running   4          78m
kube-controller-manager-k8s-master03       1/1     Running   3          76m
kube-proxy-2bzw8                           1/1     Running   3          81m
kube-proxy-lmg4n                           1/1     Running   3          76m
kube-proxy-vgg24                           1/1     Running   3          78m
kube-proxy-vwstb                           1/1     Running   3          74m
kube-scheduler-k8s-master01                1/1     Running   5          80m
kube-scheduler-k8s-master02                1/1     Running   3          78m
kube-scheduler-k8s-master03                1/1     Running   3          76m

###test

[root@k8s-master01 ~]# kubectl -n kube-system get svc
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   81m


[root@k8s-master01 ~]# kubectl exec -it alpine nslookup nginx-svc
nslookup: can't resolve '(null)': Name does not resolve

Name:      nginx-svc
Address 1: 10.107.140.184
[root@k8s-master01 ~]# kubectl -n kube-system get svc   
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   86m

but there still have the same error at the log,i don’t know why

IMO, 1.5.0, barring unforeseen issues, whenever it arrives.

But until then, I think there are three equally stable options:

  1. Fall back to 1.3.0.
  2. Or use 1.4.0, with the reload plugin removed from the Corefile.
  3. Or use 1.3.1 with workaround of mounting an EmptyDir to /tmp.

Edit: Moving to 1.5.0 may require editing the Corefile, replacing proxy with forward.

I mitigate the klog issue by mounting an EmptyDir volume to /tmp.

Le mar. 19 mars 2019 à 21:22, Richard Theis notifications@github.com a écrit :

@chrisohaver https://github.com/chrisohaver well that certainly presents support problems for Kubernetes’ default cluster DNS. Is the CoreDNS project considering increasing its support window?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/75414#issuecomment-474565193, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJrq0Rq0c6pr1ye14ZJxlBI8fqoyS4Hks5vYUb9gaJpZM4b3A1H .

The reload bug fix will be included in CoreDNS 1.4.1 release. But I think the 1.4.1 release also removes an option that is used in the default CoreDNS configuration (which would render it invalid). So we need to be careful there. Probably not where we want to go in a patch release.

We’ll see where the discussion in coredns/coredns#2708 goes. Breaking new ground by releasing a 1.3.2 with the klog fix may be the best option.

Well that certainly presents support problems for Kubernetes’ default cluster DNS.

I agree. Just letting you know project’s history. While the history cannot change, things could change going forward if there is demand. IMO, the best place to help create demand for this would be to open an issue in CoreDNS.

There is no restart with kube-dns, the DNS service continue to answer while kube-apiserver is down.

# POD_CIDR=10.244.0.0/16
# cat > /etc/kubernetes/kubeadm-config.yaml << EOF
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
kubernetesVersion: v1.14.0-beta.2
networking:
  podSubnet: "$POD_CIDR"
dns:
  type: "kube-dns"
EOF
# kubeadm init --config /etc/kubernetes/kubeadm-config.yaml
[...]
# curl https://docs.projectcalico.org/v3.6/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml | sed -e "s?192.168.0.0/16?$POD_CIDR?g" | kubectl apply -f -
[...]
# kubectl get po -n kube-system
NAME                                       READY   STATUS    RESTARTS   AGE
[...]
kube-dns-77c76db4b4-wczfp                  3/3     Running   0          3m9s
# docker restart k8s_kube-apiserver_kube-apiserver-node-1_kube-system_d0e0ac765c33aab02537d27fdddd65fa_0
# kubectl get po -n kube-system
NAME                                       READY   STATUS    RESTARTS   AGE
[...
kube-dns-77c76db4b4-wczfp                  3/3     Running   0          4m7s

the DNS service continue to respond while kube-apiserver is down:

# mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp  # Force apiserver to stay down
# docker ps | grep kube-apiserver
[ nothing ]
# dig +short kubernetes.default.svc.cluster.local @10.96.0.10
10.96.0.1