moby: DNS queries sometimes get stuck since upgrading to 1.11.0
Output of docker version:
Client:
Version: 1.11.0
API version: 1.23
Go version: go1.6.1
Git commit: 9e83765
Built:
OS/Arch: linux/amd64
Server:
Version: 1.11.0
API version: 1.23
Go version: go1.6.1
Git commit: 9e83765
Built:
OS/Arch: linux/amd64
Output of docker info:
Containers: 27
Running: 26
Paused: 0
Stopped: 1
Images: 8
Server Version: 1.11.0
Storage Driver: devicemapper
Pool Name: docker-254:4-537178368-pool
Pool Blocksize: 65.54 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file: /dev/data/dockerdata
Metadata file: /dev/data/dockermetadata
Data Space Used: 17.63 GB
Data Space Total: 107.4 GB
Data Space Available: 89.74 GB
Metadata Space Used: 33.01 MB
Metadata Space Total: 5.369 GB
Metadata Space Available: 5.336 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Library Version: 1.03.01 (2011-10-15)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: null host bridge
Kernel Version: 3.16.7-35-default
Operating System: openSUSE 13.2 (Harlequin) (x86_64)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.35 GiB
Name: shorty
ID: NOTV:7M7T:6HMW:I6DG:4MPD:M7XM:TJKR:R6F4:EXVS:4UV5:BIJZ:KWI5
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): true
File Descriptors: 158
Goroutines: 273
System Time: 2016-04-20T08:28:04.398341864+02:00
EventsListeners: 0
Username: cupracer
Registry: https://index.docker.io/v1/
WARNING: No kernel memory limit support
Additional environment details (AWS, VirtualBox, physical, etc.): This is a single Docker instance on physical hardware. Containers are usually run by using Docker Compose. Docker server-side debugging is enabled!
DOCKER_OPTS="-D --dns 8.8.8.8 --storage-opt dm.datadev=/dev/data/dockerdata --storage-opt dm.metadatadev=/dev/data/dockermetadata --storage-opt dm.fs=xfs"
DOCKER_NETWORK_OPTIONS=""
The issue: One of my containers needs to do a lot of DNS queries (~ 100 per second, if not even more). Internal (lookup other containers by name): 50% External (lookup various different domain names): 50%
This normally works fine and the performance is great, but after a while (~ 1-4 hrs) this container’s DNS queries slow down massively, although the query times in the following output stay fast. Since the container provides an Apache HTTP server with PHP enabled, it appeares unresponsive to the “outside world” because of those hanging DNS queries.
# cat /etc/resolv.conf
nameserver 127.0.0.11
options ndots:0
root@d04348b20152:/# time dig google.de
;; ANSWER SECTION:
google.de. 231 IN A 216.58.201.227
;; Query time: 16 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Wed Apr 20 06:52:35 UTC 2016
;; MSG SIZE rcvd: 54
real 0m0.021s
user 0m0.004s
sys 0m0.000s
root@d04348b20152:/# time dig google.de
;; ANSWER SECTION:
google.de. 224 IN A 216.58.201.227
;; Query time: 15 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Wed Apr 20 06:52:42 UTC 2016
;; MSG SIZE rcvd: 54
real 0m5.021s
user 0m0.004s
sys 0m0.000s
root@d04348b20152:/# time dig google.de
; <<>> DiG 9.9.5-9+deb8u6-Debian <<>> google.de
;; global options: +cmd
;; connection timed out; no servers could be reached
real 0m15.005s
user 0m0.004s
sys 0m0.000s
Internal DNS lookups (resolve MongoDB and Redis container IP’s) work fine all the time, but queries to resolve external DNS names seem to get stuck most of the time. Restarting the affected container clears the situation again (for a while).
Furthermore, my system log contains a lot of the following. I found 104 of them today between 08:20:05 and 08:52:00. On yesterday, 478 of them were logged.
2016-04-20T08:20:05.669869+02:00 shorty docker[11372]: time="2016-04-20T08:20:05.669802699+02:00" level=error msg="More than 50 concurrent queries from 127.0.0.11:34244"
...
2016-04-20T08:52:00.377952+02:00 shorty docker[11372]: time="2016-04-20T08:52:00.377857662+02:00" level=error msg="More than 50 concurrent queries from 127.0.0.11:34244"
Pings from within this container to the DNS server 8.8.8.8 and other external addresses work fine. Dig’s that use 8.8.8.8 directly instead of 127.0.0.11 work fine, too. The Docker host does not seem to have any problems in DNS resolving or reachability. The container is reachable, too. It’s just the DNS issue which breaks my application.
Nothin special gets logged (even in Debug mode) regarding a failing DNS query. Just:
2016-04-20T09:32:08.257300+02:00 shorty docker[11372]: time="2016-04-20T09:32:08.257281875+02:00" level=debug msg="To resolve: www.google.de in "
2016-04-20T09:32:08.257334+02:00 shorty docker[11372]: time="2016-04-20T09:32:08.257295204+02:00" level=debug msg="To resolve: www.google in de"
2016-04-20T09:32:08.257347+02:00 shorty docker[11372]: time="2016-04-20T09:32:08.257301203+02:00" level=debug msg="To resolve: www in google.de"
2016-04-20T09:32:08.257368+02:00 shorty docker[11372]: time="2016-04-20T09:32:08.257309488+02:00" level=debug msg="Query www.google.de.[28] from 172.18.0.4:40514, forwarding to udp:8.8.8.8"
I believe that #22144 might be related to my described issue.
Before upgrading to Docker 1.11.0 I didn’t limit Docker to “–dns 8.8.8.8”, but I also had problems with issue #22081, so I’m using a single Google DNS server as a workaround.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 8
- Comments: 103 (17 by maintainers)
@sanimej I think that I just found a better workaround!
I picked a container which had that DNS issue and added the option “use-vc” to its “/etc/resolv.conf”. This option forces DNS resolution over TCP instead of UDP. Lookups were immediately working fine again. Removing “use-vc” made the issue appear again.
I added this option to my Docker host’s resolv.conf aswell and restarted my containers. Afterwards all containers got the following resolv.conf content:
All containers seem to run like a charm now. I’ll monitor their behavior for a few hours and will report the results later on.
Sorry for creating more noise, but I’m wondering about two things:
Thanks!
Thanks to @cupracer and @sanimej for the detailed analysis reports. We are the issue of more than 100 concurrent queries errors in /var/log/syslog Our web server have high load with qps of around 1000 qps and 100 on each containers.
Our docker version is as follows
In /etc/default/docker, we have
DOCKER_OPTS="--dns 8.8.8.8 --dns 8.8.4.4"From various discussions on web around this errors, we were thinking of these two solutions.
Any help is much appreciated!
Just to confirm I’m seeing 10 second wait for dns responses inside containers ( not all, one ):
Doing a dig against the consul agent directly from outside the culprit container yields answer instantly.
From my testing, 1.12 works well.
@narel @jquacinella @estehnet Earlier on this thread on this post it was reported that switching to TCP DNS fixed the problem as a workaround. Did you guys tried that, before switching to external DNS?
I’m asking because we have been also hit with this problem - we tried to containerize a legacy VoIP server application which does a bunch of DNS resolving every 5 minutes, and after a few days it started to give problems resolving both external and container names, therefore we rolled back. But since then we’ve not been able to reproduce it in our lab.
I’m also considering using a service like consul to completely bypass docker embedded DNS, while also being able to resolve container names.
2 days in production - rc4 works fine - anyone else?
@cpuguy83 I am working on a libnetwork PR to fix this. Lets close the issue when its vendor’ed in.
Yes, this will be fixed for 1.12