moby: DNS queries sometimes get stuck since upgrading to 1.11.0

Output of docker version:

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.6.1
 Git commit:   9e83765
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.6.1
 Git commit:   9e83765
 Built:
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 27
 Running: 26
 Paused: 0
 Stopped: 1
Images: 8
Server Version: 1.11.0
Storage Driver: devicemapper
 Pool Name: docker-254:4-537178368-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/data/dockerdata
 Metadata file: /dev/data/dockermetadata
 Data Space Used: 17.63 GB
 Data Space Total: 107.4 GB
 Data Space Available: 89.74 GB
 Metadata Space Used: 33.01 MB
 Metadata Space Total: 5.369 GB
 Metadata Space Available: 5.336 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Library Version: 1.03.01 (2011-10-15)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge
Kernel Version: 3.16.7-35-default
Operating System: openSUSE 13.2 (Harlequin) (x86_64)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.35 GiB
Name: shorty
ID: NOTV:7M7T:6HMW:I6DG:4MPD:M7XM:TJKR:R6F4:EXVS:4UV5:BIJZ:KWI5
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): true
 File Descriptors: 158
 Goroutines: 273
 System Time: 2016-04-20T08:28:04.398341864+02:00
 EventsListeners: 0
Username: cupracer
Registry: https://index.docker.io/v1/
WARNING: No kernel memory limit support

Additional environment details (AWS, VirtualBox, physical, etc.): This is a single Docker instance on physical hardware. Containers are usually run by using Docker Compose. Docker server-side debugging is enabled!

DOCKER_OPTS="-D --dns 8.8.8.8 --storage-opt dm.datadev=/dev/data/dockerdata --storage-opt dm.metadatadev=/dev/data/dockermetadata --storage-opt dm.fs=xfs"
DOCKER_NETWORK_OPTIONS=""

The issue: One of my containers needs to do a lot of DNS queries (~ 100 per second, if not even more). Internal (lookup other containers by name): 50% External (lookup various different domain names): 50%

This normally works fine and the performance is great, but after a while (~ 1-4 hrs) this container’s DNS queries slow down massively, although the query times in the following output stay fast. Since the container provides an Apache HTTP server with PHP enabled, it appeares unresponsive to the “outside world” because of those hanging DNS queries.

# cat /etc/resolv.conf
nameserver 127.0.0.11
options ndots:0
root@d04348b20152:/# time dig google.de

;; ANSWER SECTION:
google.de.              231     IN      A       216.58.201.227

;; Query time: 16 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Wed Apr 20 06:52:35 UTC 2016
;; MSG SIZE  rcvd: 54

real    0m0.021s
user    0m0.004s
sys     0m0.000s

root@d04348b20152:/# time dig google.de

;; ANSWER SECTION:
google.de.              224     IN      A       216.58.201.227

;; Query time: 15 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Wed Apr 20 06:52:42 UTC 2016
;; MSG SIZE  rcvd: 54


real    0m5.021s
user    0m0.004s
sys     0m0.000s

root@d04348b20152:/# time dig google.de

; <<>> DiG 9.9.5-9+deb8u6-Debian <<>> google.de
;; global options: +cmd
;; connection timed out; no servers could be reached

real    0m15.005s
user    0m0.004s
sys     0m0.000s

Internal DNS lookups (resolve MongoDB and Redis container IP’s) work fine all the time, but queries to resolve external DNS names seem to get stuck most of the time. Restarting the affected container clears the situation again (for a while).

Furthermore, my system log contains a lot of the following. I found 104 of them today between 08:20:05 and 08:52:00. On yesterday, 478 of them were logged.

2016-04-20T08:20:05.669869+02:00 shorty docker[11372]: time="2016-04-20T08:20:05.669802699+02:00" level=error msg="More than 50 concurrent queries from 127.0.0.11:34244"
...
2016-04-20T08:52:00.377952+02:00 shorty docker[11372]: time="2016-04-20T08:52:00.377857662+02:00" level=error msg="More than 50 concurrent queries from 127.0.0.11:34244"

Pings from within this container to the DNS server 8.8.8.8 and other external addresses work fine. Dig’s that use 8.8.8.8 directly instead of 127.0.0.11 work fine, too. The Docker host does not seem to have any problems in DNS resolving or reachability. The container is reachable, too. It’s just the DNS issue which breaks my application.

Nothin special gets logged (even in Debug mode) regarding a failing DNS query. Just:

2016-04-20T09:32:08.257300+02:00 shorty docker[11372]: time="2016-04-20T09:32:08.257281875+02:00" level=debug msg="To resolve: www.google.de in "
2016-04-20T09:32:08.257334+02:00 shorty docker[11372]: time="2016-04-20T09:32:08.257295204+02:00" level=debug msg="To resolve: www.google in de"
2016-04-20T09:32:08.257347+02:00 shorty docker[11372]: time="2016-04-20T09:32:08.257301203+02:00" level=debug msg="To resolve: www in google.de"
2016-04-20T09:32:08.257368+02:00 shorty docker[11372]: time="2016-04-20T09:32:08.257309488+02:00" level=debug msg="Query www.google.de.[28] from 172.18.0.4:40514, forwarding to udp:8.8.8.8"

I believe that #22144 might be related to my described issue.

Before upgrading to Docker 1.11.0 I didn’t limit Docker to “–dns 8.8.8.8”, but I also had problems with issue #22081, so I’m using a single Google DNS server as a workaround.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 8
  • Comments: 103 (17 by maintainers)

Commits related to this issue

Most upvoted comments

@sanimej I think that I just found a better workaround!

I picked a container which had that DNS issue and added the option “use-vc” to its “/etc/resolv.conf”. This option forces DNS resolution over TCP instead of UDP. Lookups were immediately working fine again. Removing “use-vc” made the issue appear again.

I added this option to my Docker host’s resolv.conf aswell and restarted my containers. Afterwards all containers got the following resolv.conf content:

# cat /etc/resolv.conf
nameserver 127.0.0.11
options use-vc ndots:0

All containers seem to run like a charm now. I’ll monitor their behavior for a few hours and will report the results later on.

Sorry for creating more noise, but I’m wondering about two things:

  1. Am I the only one left who still has this problem? Can someone else by chance confirm that the issue still exists? Maybe @sanmai-NL ?
  2. Should we re-open this issue again, since it’s currently closed? @sanimej @vdemeester

Thanks!

Thanks to @cupracer and @sanimej for the detailed analysis reports. We are the issue of more than 100 concurrent queries errors in /var/log/syslog Our web server have high load with qps of around 1000 qps and 100 on each containers.

1415 Dec  7 00:56:12 cs1-wimt dockerd[11913]: time="2017-12-07T00:56:12.344055493+05:30" level=error msg="[resolver] more than 100 concurrent queries from 172.20.0.12:40477"
1416 Dec  7 00:56:16 cs1-wimt dockerd[11913]: time="2017-12-07T00:56:16.189711191+05:30" level=error msg="[resolver] more than 100 concurrent queries from 172.20.0.4:53545"

Our docker version is as follows

Client:
 Version:      17.11.0-ce
 API version:  1.34
 Go version:   go1.8.3
 Git commit:   1caf76c
 Built:        Mon Nov 20 18:36:37 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.11.0-ce
 API version:  1.34 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   1caf76c
 Built:        Mon Nov 20 18:35:09 2017
 OS/Arch:      linux/amd64
 Experimental: false

In /etc/default/docker, we have DOCKER_OPTS="--dns 8.8.8.8 --dns 8.8.4.4"

From various discussions on web around this errors, we were thinking of these two solutions.

  1. Is there a way to increase the 100 concurrency limit to higher via some docker config?
  2. Does this issue happen mainly because of rate-limiting by Google’s DNS? we are facing this mainly on peak load times. Would hosting an internal DNS cache servers and changing the 8.8.8.8 will hep in this situation?

Any help is much appreciated!

Just to confirm I’m seeing 10 second wait for dns responses inside containers ( not all, one ):

Time Before: Fri Apr 22 07:57:58 UTC 2016

; <<>> DiG 9.9.5-3ubuntu0.8-Ubuntu <<>> api-db.service.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23071
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;api-db.service.consul. IN  A

;; ANSWER SECTION:
sendify-api-db.service.consul. 0 IN A   172.28.0.22

;; Query time: 2 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Fri Apr 22 07:58:08 UTC 2016
;; MSG SIZE  rcvd: 63

Time After: Fri Apr 22 07:58:08 UTC 2016

Doing a dig against the consul agent directly from outside the culprit container yields answer instantly.

From my testing, 1.12 works well.

On Aug 8, 2016, at 3:26 AM, Quentin Varquet notifications@github.com wrote:

Hello, I am using docker 1.11 and I have the same problem.

After one week, my container (monitoring tool) can’t join all hostname (DNS problem).

Can you tell me if the problem is solved with the version 1.12 before I decide to upgrade my production server.

Thank you

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/docker/issues/22185#issuecomment-238102737, or mute the thread https://github.com/notifications/unsubscribe-auth/AAApmxhS8GHvk7Z23gT2LLsREAPzW0EZks5qdjFLgaJpZM4ILbqI.

@narel @jquacinella @estehnet Earlier on this thread on this post it was reported that switching to TCP DNS fixed the problem as a workaround. Did you guys tried that, before switching to external DNS?

I’m asking because we have been also hit with this problem - we tried to containerize a legacy VoIP server application which does a bunch of DNS resolving every 5 minutes, and after a few days it started to give problems resolving both external and container names, therefore we rolled back. But since then we’ve not been able to reproduce it in our lab.

I’m also considering using a service like consul to completely bypass docker embedded DNS, while also being able to resolve container names.

2 days in production - rc4 works fine - anyone else?

@cpuguy83 I am working on a libnetwork PR to fix this. Lets close the issue when its vendor’ed in.

Yes, this will be fixed for 1.12