coredns: CoreDNS pods panic with no probe failures

What happened:

Recently, we have seen CoreDNS pods panic with no probe failures. The pod logs show a panic and error yet there are no liveness nor readiness probe failures. As a result, the pod isn’t working and isn’t restarted thus leading to DNS failures within the cluster.

Pod liveness probe:

    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /health
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 60
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5

Pod readiness probe:

    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /ready
        port: 8181
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1

What you expected to happen:

CoreDNS liveness probe detects that the CoreDNS pod has failed and restarts it.

How to reproduce it (as minimally and precisely as possible):

We don’t have an easy recreate scenario. The problem may be triggered by master network connectivity flakes.

Anything else we need to know?:

No.

Environment:

  • the version of CoreDNS: 1.6.7 and 1.6.9
  • Corefile: See below
  • logs, if applicable: Not available
  • OS (e.g: cat /etc/os-release): Ubuntu 18.04.4 LTS
  • Others: Failures seen on IBM Cloud Kubernetes Service clusters running Kubernetes version 1.16 and 1.17.
    .:53 {
        errors
        health {
            lameduck 10s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
            ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 34 (19 by maintainers)

Most upvoted comments

Hi,

I am following this ticket for a few days now to figure out if it is the same issue that we have as well. After @chrisohaver posted the last stacktrace I am pretty certain that we observe the same, so I will add our observations. If you think that this is a different issue, please let me know and I will create a new one.

We recently upgraded coredns from 1.6.5 to 1.6.7 and while debugging some DNS issues in our testsuite (not the conformance tests) we noticed that occasionally a new headless services never became resolvable via coredns.

In the coredns pods we found the following, which looks a lot like the log from @chrisohaver to me:

2020-04-29T12:34:32.844045604Z W0429 12:34:32.843901       1 reflector.go:326] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: watch of *v1.Endpoints ended with: very short watch: pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Unexpected watch close - watch lasted less than a second and no items received
2020-04-29T12:34:32.845138797Z W0429 12:34:32.845082       1 reflector.go:326] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: watch of *v1.Namespace ended with: very short watch: pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Unexpected watch close - watch lasted less than a second and no items received
2020-04-29T12:34:32.848716771Z W0429 12:34:32.848631       1 reflector.go:326] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: watch of *v1.Service ended with: very short watch: pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Unexpected watch close - watch lasted less than a second and no items received
2020-04-29T12:34:35.864452626Z E0429 12:34:35.864235       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Endpoints: Get "https://10.240.16.1:443/api/v1/endpoints?limit=500&resourceVersion=0": EOF
2020-04-29T12:34:42.010802025Z E0429 12:34:42.010690       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
2020-04-29T12:34:42.010819324Z goroutine 49 [running]:
2020-04-29T12:34:42.010823824Z k8s.io/apimachinery/pkg/util/runtime.logPanic(0x194e000, 0x2cdd240)
2020-04-29T12:34:42.010834724Z 	/tmp/tmp.BwnuHrciKQ/g/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/runtime/runtime.go:74 +0xa3
2020-04-29T12:34:42.010838424Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
2020-04-29T12:34:42.010841624Z 	/tmp/tmp.BwnuHrciKQ/g/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/runtime/runtime.go:48 +0x82
2020-04-29T12:34:42.010844624Z panic(0x194e000, 0x2cdd240)
2020-04-29T12:34:42.010847424Z 	/tmp/tmp.BwnuHrciKQ/go/src/runtime/panic.go:967 +0x166
2020-04-29T12:34:42.010850324Z github.com/coredns/coredns/plugin/kubernetes/object.(*Endpoints).GetNamespace(0x0, 0x0, 0x7fc255010780)
2020-04-29T12:34:42.010853424Z 	/tmp/tmp.BwnuHrciKQ/g/src/github.com/coredns/coredns/plugin/kubernetes/object/endpoint.go:146 +0x5
2020-04-29T12:34:42.010856324Z k8s.io/client-go/tools/cache.MetaNamespaceKeyFunc(0x1b517a0, 0x0, 0x3, 0xc0001ac060, 0x1, 0x1)
2020-04-29T12:34:42.010859224Z 	/tmp/tmp.BwnuHrciKQ/g/pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/store.go:84 +0x105
2020-04-29T12:34:42.010862224Z k8s.io/client-go/tools/cache.DeletionHandlingMetaNamespaceKeyFunc(0x1b517a0, 0x0, 0x1, 0x1, 0x203000, 0x203000)
2020-04-29T12:34:42.010865224Z 	/tmp/tmp.BwnuHrciKQ/g/pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/controller.go:264 +0x6a
2020-04-29T12:34:42.010868224Z k8s.io/client-go/tools/cache.(*cache).Delete(0xc00024e3e0, 0x1b517a0, 0x0, 0x0, 0x18)
2020-04-29T12:34:42.010871124Z 	/tmp/tmp.BwnuHrciKQ/g/pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/store.go:144 +0x42
2020-04-29T12:34:42.010874024Z github.com/coredns/coredns/plugin/kubernetes.newdnsController.func1.1(0x1970160, 0xc00024f9e0, 0x1, 0xc00024f9e0)
2020-04-29T12:34:42.010877024Z 	/tmp/tmp.BwnuHrciKQ/g/src/github.com/coredns/coredns/plugin/kubernetes/controller.go:167 +0x5af
2020-04-29T12:34:42.010880024Z k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc00053e2c0, 0xc00057ccf0, 0x0, 0x0, 0x0, 0x0)
2020-04-29T12:34:42.010882924Z 	/tmp/tmp.BwnuHrciKQ/g/pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/delta_fifo.go:422 +0x235
2020-04-29T12:34:42.010885824Z k8s.io/client-go/tools/cache.(*controller).processLoop(0xc000481080)
2020-04-29T12:34:42.010888924Z 	/tmp/tmp.BwnuHrciKQ/g/pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/controller.go:153 +0x40
2020-04-29T12:34:42.010891824Z k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000157f80)
2020-04-29T12:34:42.010895224Z 	/tmp/tmp.BwnuHrciKQ/g/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:152 +0x5f
2020-04-29T12:34:42.010898124Z k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000157f80, 0x3b9aca00, 0x0, 0xc00048f001, 0xc0004862a0)
2020-04-29T12:34:42.010901124Z 	/tmp/tmp.BwnuHrciKQ/g/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:153 +0xf8
2020-04-29T12:34:42.010904024Z k8s.io/apimachinery/pkg/util/wait.Until(...)
2020-04-29T12:34:42.010906924Z 	/tmp/tmp.BwnuHrciKQ/g/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:88
2020-04-29T12:34:42.010909724Z k8s.io/client-go/tools/cache.(*controller).Run(0xc000481080, 0xc0004862a0)
2020-04-29T12:34:42.010915724Z 	/tmp/tmp.BwnuHrciKQ/g/pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/controller.go:125 +0x2c1
2020-04-29T12:34:42.010918824Z created by github.com/coredns/coredns/plugin/kubernetes.(*dnsControl).Run
2020-04-29T12:34:42.010921724Z 	/tmp/tmp.BwnuHrciKQ/g/src/github.com/coredns/coredns/plugin/kubernetes/controller.go:304 +0x12b

Restarting the coredns pods immediately fixed it.

We downgraded back to 1.6.5, but to our surprise we have now seen the same happening there. It seems like we did not notice it before.

So far we also have seen this only in combination with headless services.

In this thread the issue has been roughly correlated to a flaky network. I would say that this could be the cause/trigger in our case as well, because our underlying SDN has some known (but hard to solve) issues.

We also have a lot of 2020-04-29T12:37:52.054149918Z [ERROR] plugin/errors: 2 . NS: dns: overflow unpacking uint32 messages in the log, but I have no idea if they have anything to do with the rest.

We run ~300 coredns instances, so I will try to collect some more logs to get a better understanding how often this happens and to see if I notice anything that could be of help.

If you can think of anything I should look for or I could do to help, let me know.

Cheers Max

Thanks all. I think I’ve worked out the proper fix in #3887.

would it be possible to get one more patch release of 1.6 with this fix?

@rtheis, This was discussed in #3925 … I understand that the coredns release process currently supports a single branch, and extending it to support multiple release branches is non trivial.

1.7.0 is backward incompatible, but not heavily so …

The backward incompatibility is foremost metrics renaming - which will require updating promql in reporting apps that reference the affected metrics. Unfortunately, the old metrics dont exist along side of the new metric names in any release. Without a deprecation grace period it means some coredns reporting could be broken until the formulas are updated. IOW, The 1.7.0 metrics name changes can break your reporting, but it shouldn’t break DNS function.

1.7.0 also removes some options from kubernetes plugin. Those options had already been deprecated and ignored for several releases, but if they are still present in 1.7.0, CoreDNS will not start.

@chrisohaver Do you know if the fix for this issue will make it into a next 1.6.x patch release?

I think the plan is to release 1.7.0 “soon”, instead of another 1.6.x.

IMO, the known network flakes during the test probably caused an issue in the test cluster that prevented CoreDNS from receiving DNS queries. I don’t think there is enough to go on at this time. If this re-occurs, and we can show that CoreDNS pods are receiving DNS requests during these test failures, then we should re-open.