dns: DNS resolution failed until restart of container

Hi! First of all, thank you for your amazing work, I’m a big fan of your project.

I have a question for you: does the LRU cache saves empty responses? I’m asking this because something strange happened yesterday.

FYI I upgraded to v2.0.0-beta about ~2 weeks ago. Everything was working fine until yesterday.

For context, some of my services resolve very often (like once every second) a specific hostname—let’s call it acme.com. So they are basically hitting the LRU most of the time, if not all the time.

Yesterday that hostname couldn’t get resolved anymore by my services for some reason. After I noticed the issue I tried:

> ping acme.com
ping: acme.com: Name or service not known

But at the same time I could resolve google.com

> ping google.com
PING google.com(fra24s04-in-x0e.1e100.net (2a00:1450:4001:827::200e)) 56 data bytes
64 bytes from fra24s04-in-x0e.1e100.net (2a00:1450:4001:827::200e): icmp_seq=1 ttl=118 time=5.11 ms
64 bytes from fra24s04-in-x0e.1e100.net (2a00:1450:4001:827::200e): icmp_seq=2 ttl=118 time=5.25 ms
64 bytes from fra24s04-in-x0e.1e100.net (2a00:1450:4001:827::200e): icmp_seq=3 ttl=118 time=5.33 ms

Once I restarted the DNS container, the hostname could get resolved once again right away.

My config is very simple:

version: "3.7"

services:
  dns-server:
    image: qmcgaw/dns:v2.0.0-beta
    container_name: dns-server
    restart: always
    environment:
      - BLOCK_MALICIOUS=on
      - BLOCK_SURVEILLANCE=on
      - BLOCK_ADS=on
      - METRICS_TYPE=prometheus
    ports:
      - 127.0.0.1:53:53/udp
      - 172.17.0.1:53:53/udp
      - 172.17.0.1:9892:9090/tcp
    network_mode: bridge

Those are the last logs that got written before I restarted my DNS container. I got some warnings but I don’t think they are important. FYI the issue started the 28th around 17:30.

2021/10/27 21:46:39 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:35246->1.1.1.1:853: i/o timeout
2021/10/27 23:00:01 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:53620->1.1.1.1:853: i/o timeout
2021/10/28 01:54:45 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:60710->1.1.1.1:853: i/o timeout
2021/10/28 03:42:38 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:50438->1.1.1.1:853: i/o timeout
2021/10/28 03:54:29 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:54982->1.0.0.1:853: i/o timeout
2021/10/28 09:49:23 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:36820->1.0.0.1:853: i/o timeout
2021/10/28 11:09:58 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:36598->1.1.1.1:853: i/o timeout
2021/10/28 12:46:54 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:33552->1.0.0.1:853: i/o timeout
2021/10/28 13:26:33 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:43944->1.0.0.1:853: i/o timeout
2021/10/28 21:17:11 WARN cannot exchange over DoT connection: read tcp 172.17.0.2:52100->1.1.1.1:853: i/o timeout
2021/10/28 21:29:20 INFO planned periodic restart of DNS server
2021/10/28 21:29:20 INFO downloading and building DNS block lists
2021/10/28 21:29:21 INFO 1220920 hostnames blocked overall
2021/10/28 21:29:21 INFO 29956 IP addresses blocked overall
2021/10/28 21:29:21 INFO 2482 IP networks blocked overall
2021/10/28 21:29:21 INFO starting DNS server
2021/10/28 21:29:21 INFO DNS server listening on :53

Could it be possible that the DNS server saved an empty resolution in the LRU during one of those timeouts and then kept serving that empty resolution forever to my services? What do you think?

Thank you very much for your help.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 21 (11 by maintainers)

Most upvoted comments

Empty response (or response will N empty records) will not be inserted into the cache with commit 70229f5b1644416ee99310e653e110b00a33c9bc

Let me know if you encounter the problem again. Enjoy! 😉

Ok so I broke a few things when doing #111 recently, fixed them in this build in progress:

  • listening address is fixed in 63d2081ec76234d9c8eb57f9f0fba1d91c2f2ca7
  • update period is fixed in 1eeac399af6f99611d51566410ce61cbdb6bd4fe to allow 0
  • Cache Prometheus metrics (subsystem empty) fixed in 073bb9034f84d17a597fea9ac0a87745e2cfce20
  • Filter Prometheus metrics fixed in 500764e1120f9db9208addfa6fcd32dc4392b570

I’ll monitor my own dashboard see if I can spot any oddities as well.

well, heh, through my make commands I just pulled your latest updates to the v2.0.0-beta and no longer get any response from the server whatsoeever unfortunately 😦

Recent images from today were broken (oops my bad but it’s beta), try repulling it the last image v2.0.0-beta should be working 😉

I’m aiming at releasing a stable v2 this weekend-ish (after a year lol), so I’m definitely interested if it’s unstable.

@Kampe maybe it loses connectivity for another reason? I have it running on amd64 for weeks without issue here 🤔 Maybe try setting up the prometheus + grafana dashboard https://github.com/qdm12/dns/tree/v2.0.0-beta/readme/metrics to see why/how it fails?

A bit too fast, commit a750d22112a333afae04b702933741f9fed18556 overwrites it since one my test was failing 😄 Image should be built soon.

Hey there! I’m glad you’re having fun with it.

I had a similar issue where it would not manage to resolve my own dns record (empty cached response) and my ddns container would update the record like a madman… Not sure if that’s related.

The code adds to the lru cache here: https://github.com/qdm12/dns/blob/master/pkg/cache/lru/lru.go#L34

And it doesn’t check if the response is empty. I will add that. The cache get function however respects the ttl of the record. So if it’s empty and got updated, it will get evicted from the cache once it’s set ttl is over (usually 1hr).