coredns: CoreDNS degrades and eventually stops functioning - tls: DialWithDialer timed out

What happened:

I am running coreDNS inside docker on a raspberry pi. In this configuration, it’s functioning as the DNS to DNS over TLS resolver. About every 12-24 hours, DNS resolution performance craters before it stops functioning entirely. Just prior to process restart, the logs are full of errors like this:

[ERROR] plugin/errors: 2 clients4.google.com. A: tls: DialWithDialer timed out

It should be noted that the errors appear before DNS utterly breaks on my network. I am not yet certain if DNS resolution breaks but i don’t notice because of cached queries or not. In some cases, performance seems to bounce back as i’ll show in the attached screenshot.

What you expected to happen:

I expect coreDNS to not fall over on an interval and continue resolving DNS queries reliably.

How to reproduce it (as minimally and precisely as possible):

I have two docker containers that are relevant here:

pihole
coreDNS

The pihole container is bound to port 53 on the hosts network. The coreDNS container is bound to port 53 on a private container network.

piHole receives all DNS queries and, after filtering, sends them to the coreDNS container which then uses cloudflares DNS over TLS service to resolve them.

Here is a simplified docker-compose file that should get you the same setup. Note: I have a similar setup (read: same docker-compose) but on x84_64 running in AWS and i have not noticed this issue. I can’t prove that you need armv7 to reproduce this issue, but it would certainly help to use the same raspbian image that i am using.

# cat docker-compose.yaml
version: '3.5'
services:
  # The best damn DNS filtering tool out there; please support them!
  # See: https://pi-hole.net/donate/
  pihole:
    container_name: pihole

    # See: https://hub.docker.com/r/pihole/pihole/
    image: pihole/pihole

    # Always restart the container, unless operator has explicitly told us to stop
    restart:  unless-stopped

    hostname: pi

    networks:
      # Because it's backend, no ports to open!
      backend:
        ipv4_address: 172.16.241.8

    ports:
      # We will need port 53 to go right into the DNS resolver on piHole
      - target: 53
        published: 53
        protocol: tcp
        mode: host
      - target: 53
        published: 53
        protocol: udp
        mode: host

    cap_add:
      # Needed for binding to ports lower than 1024
      - NET_ADMIN

    # Not a whole lot of logs to deal with, but no need to let them linger... forever
    logging:
      driver: "json-file"
      options:
        max-file: "5"
        max-size: "10m"

  # A wonderful little DNS server; used here to terminate DoT
  coredns:
    container_name: coredns

    # See: https://hub.docker.com/r/coredns/coredns/
    image: coredns/coredns:1.6.5

    # Always restart the container, unless operator has explicitly told us to stop
    restart: unless-stopped

    networks:
      # TLS connections ingress over this guy
      frontend:
        ipv4_address: 172.16.240.9
      # This is how we'll talk to the resolver in piHole
      backend:
        ipv4_address: 172.16.241.9
    hostname: coredns

    volumes:
      # Config
      - /opt/pihole/docker/coredns/vol/config/Corefile:/Corefile:ro

    logging:
      driver: "json-file"
      options:
        max-file: "5"
        max-size: "10m"

# We create two networks; one for inbound and one for backend
networks:
  frontend:
    # use "pretty names" otherwise, docker-compose will generate pseudo random prefixes on the net names
    name: frontend
    # IPAM is IP Address Mgmt
    ipam:
      config:
      - subnet: 172.16.240.0/24
  backend:
    name: backend
    # IPAM is IP Address Mgmt
    ipam:
      config:
      - subnet: 172.16.241.0/24

Anything else we need to know?:

Environment:

Running bog-standard raspbian:

# cat /etc/os-release
PRETTY_NAME="Raspbian GNU/Linux 9 (stretch)"
NAME="Raspbian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
VERSION_CODENAME=stretch
ID=raspbian
ID_LIKE=debian
HOME_URL="http://www.raspbian.org/"
SUPPORT_URL="http://www.raspbian.org/RaspbianForums"
BUG_REPORT_URL="http://www.raspbian.org/RaspbianBugs"

On a previous generation (Pi 3 Model B) rPi:

# cat /proc/cpuinfo
processor	: 0
model name	: ARMv7 Processor rev 4 (v7l)
BogoMIPS	: 38.40
Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

processor	: 1
model name	: ARMv7 Processor rev 4 (v7l)
BogoMIPS	: 38.40
Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

processor	: 2
model name	: ARMv7 Processor rev 4 (v7l)
BogoMIPS	: 38.40
Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

processor	: 3
model name	: ARMv7 Processor rev 4 (v7l)
BogoMIPS	: 38.40
Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

Hardware	: BCM2835
Revision	: a02082
Serial		: <Omitted>

Running a reasonably new kernel:

# uname -a
Linux raspberrypi 4.19.66-v7+ #1253 SMP Thu Aug 15 11:49:46 BST 2019 armv7l GNU/Linux

With a relatively low uptime:

# uptime
 17:55:34 up 2 days, 12:33,  1 user,  load average: 0.13, 0.52, 0.66

And a pretty new version of docker:

# docker version
Client: Docker Engine - Community
 Version:           19.03.5
 API version:       1.40
 Go version:        go1.12.12
 Git commit:        633a0ea
 Built:             Wed Nov 13 07:36:04 2019
 OS/Arch:           linux/arm
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.5
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.12
  Git commit:       633a0ea
  Built:            Wed Nov 13 07:30:06 2019
  OS/Arch:          linux/arm
  Experimental:     false
 containerd:
  Version:          1.2.10
  GitCommit:        b34a5c8af56e510852c35414db4c1f4fa6172339
 runc:
  Version:          1.0.0-rc8+dev
  GitCommit:        3e425f80a8c931f88e6d94a8c831b9d5aa481657
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

With the version of CoreDNS:

Latest/1.6.5

docker container inspect coredns -f '{{.Image}}'
9ebb7d1652753257d2bfd9d1ac35629154438d7208834d398d38bd5ff97f46ec

docker image inspect sha256:9ebb7d1652753257d2bfd9d1ac35629154438d7208834d398d38bd5ff97f46ec -f '{{.RepoTags}}'
[coredns/coredns:1.6.5 coredns/coredns:latest]

docker image inspect sha256:9ebb7d1652753257d2bfd9d1ac35629154438d7208834d398d38bd5ff97f46ec -f '{{.Created}}'
2019-11-05T13:59:28.89969812Z

And a Corefile that’s really simple:

config# cat Corefile
.:53 {
    # Note: Add log/health for diagnostics for this ticket; in production only "errors" is present
    log
    errors
    health

    # Forward off to cloudflare, over TLS
    # See: https://developers.cloudflare.com/1.1.1.1/setting-up-1.1.1.1/
    ##
    forward . tls://1.1.1.1 tls://1.0.0.1 {
        # For IPv4: 1.1.1.1 and 1.0.0.1
        # For IPv6: 2606:4700:4700::1111 and 2606:4700:4700::1001

        tls_servername tls.cloudflare-dns.com
    }
}

logs, if applicable:

See the attached screenshot w/ some more observations.

Others:

There appears to be one relevant ticket: #2265.

I drafted this ticket on dec 7 and decided to wait to see if i could capture more details before posting. I don’t have anything conclusive, but, on the morning of dec 8, i did notice DNS queries slow and failing. I immediately opened up a SSH session to the host in question and observed that all 4 cores were pegged and coredns had spewed hundreds of “timed out” errors. I managed to quickly grab a screenshot… although not quite in time to show that the coredns process was what was pegging the CPU.

I should also add that when i have found the resolver to be “broken”, the fix was a process restart and the CPU cores were not pegged. That is, the system’s load returned to lower “normal” levels but coredns still was not resolving any queries. However, in the incident from the morning of dec 8, coredns managed to return to a functional state - and system load returned to normal - w/o having to restart the corends container.

See my notes from this morning:

(in the event that markdown wont render for you: https://i.imgur.com/7mgGBYe.png or https://imgur.com/a/htIDyIu)

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 23 (10 by maintainers)

Most upvoted comments

Update: the server’s clock had drifted over 100 seconds faster than the rest of the world (due to k3os, as it turns out, not configuring any ntp servers by default) - this may have also had an impact on queries “timing out” prematurely.

stuartpb on Jun 2, 2020