coredns: CoreDNS degrades and eventually stops functioning - tls: DialWithDialer timed out
What happened:
I am running coreDNS inside docker on a raspberry pi. In this configuration, it’s functioning as the DNS to DNS over TLS resolver. About every 12-24 hours, DNS resolution performance craters before it stops functioning entirely. Just prior to process restart, the logs are full of errors like this:
[ERROR] plugin/errors: 2 clients4.google.com. A: tls: DialWithDialer timed out
It should be noted that the errors appear before DNS utterly breaks on my network. I am not yet certain if DNS resolution breaks but i don’t notice because of cached queries or not. In some cases, performance seems to bounce back as i’ll show in the attached screenshot.
What you expected to happen:
I expect coreDNS to not fall over on an interval and continue resolving DNS queries reliably.
How to reproduce it (as minimally and precisely as possible):
I have two docker containers that are relevant here:
- pihole
- coreDNS
The pihole
container is bound to port 53 on the hosts network.
The coreDNS
container is bound to port 53 on a private container network.
piHole
receives all DNS queries and, after filtering, sends them to the coreDNS
container which then uses cloudflares DNS over TLS service to resolve them.
Here is a simplified docker-compose file that should get you the same setup. Note: I have a similar setup (read: same docker-compose) but on x84_64 running in AWS and i have not noticed this issue. I can’t prove that you need armv7
to reproduce this issue, but it would certainly help to use the same raspbian image that i am using.
# cat docker-compose.yaml
version: '3.5'
services:
# The best damn DNS filtering tool out there; please support them!
# See: https://pi-hole.net/donate/
pihole:
container_name: pihole
# See: https://hub.docker.com/r/pihole/pihole/
image: pihole/pihole
# Always restart the container, unless operator has explicitly told us to stop
restart: unless-stopped
hostname: pi
networks:
# Because it's backend, no ports to open!
backend:
ipv4_address: 172.16.241.8
ports:
# We will need port 53 to go right into the DNS resolver on piHole
- target: 53
published: 53
protocol: tcp
mode: host
- target: 53
published: 53
protocol: udp
mode: host
cap_add:
# Needed for binding to ports lower than 1024
- NET_ADMIN
# Not a whole lot of logs to deal with, but no need to let them linger... forever
logging:
driver: "json-file"
options:
max-file: "5"
max-size: "10m"
# A wonderful little DNS server; used here to terminate DoT
coredns:
container_name: coredns
# See: https://hub.docker.com/r/coredns/coredns/
image: coredns/coredns:1.6.5
# Always restart the container, unless operator has explicitly told us to stop
restart: unless-stopped
networks:
# TLS connections ingress over this guy
frontend:
ipv4_address: 172.16.240.9
# This is how we'll talk to the resolver in piHole
backend:
ipv4_address: 172.16.241.9
hostname: coredns
volumes:
# Config
- /opt/pihole/docker/coredns/vol/config/Corefile:/Corefile:ro
logging:
driver: "json-file"
options:
max-file: "5"
max-size: "10m"
# We create two networks; one for inbound and one for backend
networks:
frontend:
# use "pretty names" otherwise, docker-compose will generate pseudo random prefixes on the net names
name: frontend
# IPAM is IP Address Mgmt
ipam:
config:
- subnet: 172.16.240.0/24
backend:
name: backend
# IPAM is IP Address Mgmt
ipam:
config:
- subnet: 172.16.241.0/24
Anything else we need to know?:
Environment:
Running bog-standard raspbian:
# cat /etc/os-release
PRETTY_NAME="Raspbian GNU/Linux 9 (stretch)"
NAME="Raspbian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
VERSION_CODENAME=stretch
ID=raspbian
ID_LIKE=debian
HOME_URL="http://www.raspbian.org/"
SUPPORT_URL="http://www.raspbian.org/RaspbianForums"
BUG_REPORT_URL="http://www.raspbian.org/RaspbianBugs"
On a previous generation (Pi 3 Model B) rPi:
# cat /proc/cpuinfo
processor : 0
model name : ARMv7 Processor rev 4 (v7l)
BogoMIPS : 38.40
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
processor : 1
model name : ARMv7 Processor rev 4 (v7l)
BogoMIPS : 38.40
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
processor : 2
model name : ARMv7 Processor rev 4 (v7l)
BogoMIPS : 38.40
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
processor : 3
model name : ARMv7 Processor rev 4 (v7l)
BogoMIPS : 38.40
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
Hardware : BCM2835
Revision : a02082
Serial : <Omitted>
Running a reasonably new kernel:
# uname -a
Linux raspberrypi 4.19.66-v7+ #1253 SMP Thu Aug 15 11:49:46 BST 2019 armv7l GNU/Linux
With a relatively low uptime:
# uptime
17:55:34 up 2 days, 12:33, 1 user, load average: 0.13, 0.52, 0.66
And a pretty new version of docker:
# docker version
Client: Docker Engine - Community
Version: 19.03.5
API version: 1.40
Go version: go1.12.12
Git commit: 633a0ea
Built: Wed Nov 13 07:36:04 2019
OS/Arch: linux/arm
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.5
API version: 1.40 (minimum version 1.12)
Go version: go1.12.12
Git commit: 633a0ea
Built: Wed Nov 13 07:30:06 2019
OS/Arch: linux/arm
Experimental: false
containerd:
Version: 1.2.10
GitCommit: b34a5c8af56e510852c35414db4c1f4fa6172339
runc:
Version: 1.0.0-rc8+dev
GitCommit: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
docker-init:
Version: 0.18.0
GitCommit: fec3683
With the version of CoreDNS:
Latest/1.6.5
docker container inspect coredns -f '{{.Image}}'
9ebb7d1652753257d2bfd9d1ac35629154438d7208834d398d38bd5ff97f46ec
docker image inspect sha256:9ebb7d1652753257d2bfd9d1ac35629154438d7208834d398d38bd5ff97f46ec -f '{{.RepoTags}}'
[coredns/coredns:1.6.5 coredns/coredns:latest]
docker image inspect sha256:9ebb7d1652753257d2bfd9d1ac35629154438d7208834d398d38bd5ff97f46ec -f '{{.Created}}'
2019-11-05T13:59:28.89969812Z
And a Corefile
that’s really simple:
config# cat Corefile
.:53 {
# Note: Add log/health for diagnostics for this ticket; in production only "errors" is present
log
errors
health
# Forward off to cloudflare, over TLS
# See: https://developers.cloudflare.com/1.1.1.1/setting-up-1.1.1.1/
##
forward . tls://1.1.1.1 tls://1.0.0.1 {
# For IPv4: 1.1.1.1 and 1.0.0.1
# For IPv6: 2606:4700:4700::1111 and 2606:4700:4700::1001
tls_servername tls.cloudflare-dns.com
}
}
- logs, if applicable:
See the attached screenshot w/ some more observations.
- Others:
There appears to be one relevant ticket: #2265.
I drafted this ticket on dec 7 and decided to wait to see if i could capture more details before posting. I don’t have anything conclusive, but, on the morning of dec 8, i did notice DNS queries slow and failing. I immediately opened up a SSH session to the host in question and observed that all 4 cores were pegged and coredns
had spewed hundreds of “timed out” errors. I managed to quickly grab a screenshot… although not quite in time to show that the coredns
process was what was pegging the CPU.
I should also add that when i have found the resolver to be “broken”, the fix was a process restart and the CPU cores were not pegged. That is, the system’s load returned to lower “normal” levels but coredns
still was not resolving any queries. However, in the incident from the morning of dec 8, coredns
managed to return to a functional state - and system load returned to normal - w/o having to restart the corends
container.
See my notes from this morning:
(in the event that markdown wont render for you: https://i.imgur.com/7mgGBYe.png or https://imgur.com/a/htIDyIu)
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 23 (10 by maintainers)
Update: the server’s clock had drifted over 100 seconds faster than the rest of the world (due to k3os, as it turns out, not configuring any ntp servers by default) - this may have also had an impact on queries “timing out” prematurely.