coredns: CoreDNS pods panic with no probe failures
What happened:
Recently, we have seen CoreDNS pods panic with no probe failures. The pod logs show a panic and error yet there are no liveness nor readiness probe failures. As a result, the pod isn’t working and isn’t restarted thus leading to DNS failures within the cluster.
Pod liveness probe:
livenessProbe:
failureThreshold: 5
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
Pod readiness probe:
readinessProbe:
failureThreshold: 3
httpGet:
path: /ready
port: 8181
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
What you expected to happen:
CoreDNS liveness probe detects that the CoreDNS pod has failed and restarts it.
How to reproduce it (as minimally and precisely as possible):
We don’t have an easy recreate scenario. The problem may be triggered by master network connectivity flakes.
Anything else we need to know?:
No.
Environment:
- the version of CoreDNS: 1.6.7 and 1.6.9
- Corefile: See below
- logs, if applicable: Not available
- OS (e.g:
cat /etc/os-release
): Ubuntu 18.04.4 LTS - Others: Failures seen on IBM Cloud Kubernetes Service clusters running Kubernetes version 1.16 and 1.17.
.:53 {
errors
health {
lameduck 10s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 34 (19 by maintainers)
Hi,
I am following this ticket for a few days now to figure out if it is the same issue that we have as well. After @chrisohaver posted the last stacktrace I am pretty certain that we observe the same, so I will add our observations. If you think that this is a different issue, please let me know and I will create a new one.
We recently upgraded coredns from 1.6.5 to 1.6.7 and while debugging some DNS issues in our testsuite (not the conformance tests) we noticed that occasionally a new headless services never became resolvable via coredns.
In the coredns pods we found the following, which looks a lot like the log from @chrisohaver to me:
Restarting the coredns pods immediately fixed it.
We downgraded back to 1.6.5, but to our surprise we have now seen the same happening there. It seems like we did not notice it before.
So far we also have seen this only in combination with headless services.
In this thread the issue has been roughly correlated to a flaky network. I would say that this could be the cause/trigger in our case as well, because our underlying SDN has some known (but hard to solve) issues.
We also have a lot of
2020-04-29T12:37:52.054149918Z [ERROR] plugin/errors: 2 . NS: dns: overflow unpacking uint32
messages in the log, but I have no idea if they have anything to do with the rest.We run ~300 coredns instances, so I will try to collect some more logs to get a better understanding how often this happens and to see if I notice anything that could be of help.
If you can think of anything I should look for or I could do to help, let me know.
Cheers Max
Thanks all. I think I’ve worked out the proper fix in #3887.
@rtheis, This was discussed in #3925 … I understand that the coredns release process currently supports a single branch, and extending it to support multiple release branches is non trivial.
1.7.0 is backward incompatible, but not heavily so …
The backward incompatibility is foremost metrics renaming - which will require updating promql in reporting apps that reference the affected metrics. Unfortunately, the old metrics dont exist along side of the new metric names in any release. Without a deprecation grace period it means some coredns reporting could be broken until the formulas are updated. IOW, The 1.7.0 metrics name changes can break your reporting, but it shouldn’t break DNS function.
1.7.0 also removes some options from kubernetes plugin. Those options had already been deprecated and ignored for several releases, but if they are still present in 1.7.0, CoreDNS will not start.
I think the plan is to release 1.7.0 “soon”, instead of another 1.6.x.
IMO, the known network flakes during the test probably caused an issue in the test cluster that prevented CoreDNS from receiving DNS queries. I don’t think there is enough to go on at this time. If this re-occurs, and we can show that CoreDNS pods are receiving DNS requests during these test failures, then we should re-open.