moby: Docker swarm DNS periodically fails

In a 3-master swarm every few days one of the nodes will fail to have the internal swarm DNS resolve services by name. The swarm will operate fine for several days then DNS just stops working. I don’t yet know if there is some specific change that we are making that causes the issue - our system does automated deploys and we haven’t yet correlated an aspect of those automated deploys to when the issue is triggered.

Steps to reproduce the issue:

Run a 3-master docker swarm in EC2 for several days
Sometimes after 2-4 days one of the nodes can’t resolve DNS names for services not running on it locally but running in the swarm.
There is no step 3.

Describe the results you received: We run nginx inside our swarm as a reverse http proxy to our various services. We know that the DNS isn’t working because we have disabled nginx DNS caching and suddenly it will stop resolving the IP of the service where our application is running. When I execute nslookup inside the containers running on the effected node they will fail to find any of the services running on other nodes in the swarm, but they will find services running on the same node.

Describe the results you expected: For name resolution to continue working.

Additional information you deem important (e.g. issue happens only occasionally): The issue only happens occasionally. It is resolved if I restart the docker daemon on the node that can no longer resolve DNS

Output of docker version:

Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:10:54 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:10:54 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 19
 Running: 12
 Paused: 0
 Stopped: 7
Images: 24
Server Version: 17.05.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 195
 Dirperm1 Supported: true
Logging Driver: json-file
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: 6mugebxyus7dgoip9i165mj64
 Is Manager: true
 ClusterID: mn6l9qnshdxzzxfoxwpsa18xe
 Managers: 3
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.0.46.77
 Manager Addresses:
  10.0.101.134:2377
  10.0.109.151:2377
  10.0.46.77:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-57-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.674GiB
Name: ip-10-0-46-77
ID: XMND:RWXY:RGVB:F4BK:TDA3:LQ3W:URQH:T6ES:HLQE:74A7:FQG4:RIYY
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: authentiseautomation
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.): AWS EC2 instances

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 23 (5 by maintainers)

Most upvoted comments

@EliRibble In the swarm mode Service Discovery information is maintained at a per network per service level. swarm overlay network is created in the manager but the actual kernel data path instantiation happens at the node level when there is a task scheduled on that node and attached to a network. That network is cleaned up from the kernel data path when the last task goes down (network still remains at the swarm manager level). The same happens for the service information as well, ie: when there are no more tasks on a given network all the service info associated with that network are cleaned up.

In a scenario where tasks are going up & down on a network in a given node, depending on how quickly that happens there can be races between creation/deletion of service and network structures. Some of these races weren’t handled correctly leading to inconsistent state. This glosses over gory details, but thats the gist of it.

These are the fixes that went in to address this… https://github.com/docker/libnetwork/pull/1796 https://github.com/docker/libnetwork/pull/1792 https://github.com/docker/libnetwork/pull/1808

Please note that 17.06 is not GA yet, and there can be additional changes to the RC before the GA. Its ok to try it out in a test env though.

sanimej on Jun 23, 2017