coredns: Kubernetes - SERVFAIL if we cycle our apiserver certs

We cycle our apiserver certs on a regular basis. In the same time coredns is not able to answer dns queries.

Hard facts:

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.2", GitCommit:"cff46ab41ff0bb44d8584413b598ad8360ec1def", GitTreeState:"clean", BuildDate:"2019-01-10T23:35:51Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.5", GitCommit:"753b2dbc622f5cc417845f0ff8a77f539a4213ea", GitTreeState:"clean", BuildDate:"2018-11-26T14:31:35Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
.:53
2019-01-14T21:20:42.732Z [INFO] CoreDNS-1.3.1
2019-01-14T21:20:42.732Z [INFO] linux/amd64, go1.11.4, 6b56a9c
CoreDNS-1.3.1
linux/amd64, go1.11.4, 6b56a9c

Config:

    .:53 {
        errors
        health
        autopath @kubernetes
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods verified
          upstream
          fallthrough in-addr.arpa ip6.arpa
          ttl 5
        }
        prometheus :9153
        proxy . /etc/resolv.conf {
          policy sequential
        }
        cache 300
        reload
    }

Below a screenshot of a visualization of logs. The green bars are apiserver restarts. Other colors are several dns queries failing. Last 24 hours. The 2 reloads to the right are reloads of the standby master 2. The reload of the active master is barely visible between the dns errors.

screenshot from 2019-01-15 08-47-12

There are two other guys in the slack channel with the same problem. Maybe they can post their findings here…

Greetings, Max

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 4
  • Comments: 29 (14 by maintainers)

Most upvoted comments

We are observing the same behaviour. We currently run kube 1.12.3 (we have clusters both on aws and gcp) and we are restarting kubelets daily to rotate certs (includes restarting kubelet systemd unit).

Our coredns conf looks like:

    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          endpoint_pod_names
          upstream
          fallthrough in-addr.arpa ip6.arpa
          ttl 5
        }

        prometheus :9153
        proxy . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }

The way to reproduce the problem is just to restart kubelets on master nodes sequentially. Also note that just deleting kube apiserver pods doesn’t seem to cause an issue.

The only interesting log that I could spot is:

coredns-7b678dd5bb-mnzzrcoredns-7b678dd5bb-2lf5w  corednscoredns E0111 15:00:26.752836       1 reflector.go:322] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to watch *v1.Namespace: Get https://10.3.0.1:443/api/v1/namespaces?resourceVersion=140762277&timeoutSeconds=559&watch=true: dial tcp 10.3.0.1:443: connect: connection refused
 E0111 15:00:26.741853       1 reflector.go:322] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to watch *v1.Endpoints: Get https://10.3.0.1:443/api/v1/endpoints?resourceVersion=140765670&timeoutSeconds=347&watch=true: dial tcp 10.3.0.1:443: connect: connection refused
ERROR: logging before flag.Parse: E0111 15:00:34.016793   12065 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=9, ErrCode=NO_ERROR, debug=""

During the time of failure we cannot resolve anything, dig gives:

dnstools# dig prometheus.sys-prom.svc.cluster.local 

; <<>> DiG 9.11.3 <<>> prometheus.sys-prom.svc.cluster.local
;; global options: +cmd
;; connection timed out; no servers could be reached

it looks like some kind of error that cannot be handled by the stream watcher and results in coredns failing to respond.

@ffilippopoulos I think its important to note that restarting our kubelet on master also does an explicit docker restart of the api-server component: