moby: Docker swarm DNS periodically fails
In a 3-master swarm every few days one of the nodes will fail to have the internal swarm DNS resolve services by name. The swarm will operate fine for several days then DNS just stops working. I don’t yet know if there is some specific change that we are making that causes the issue - our system does automated deploys and we haven’t yet correlated an aspect of those automated deploys to when the issue is triggered.
Steps to reproduce the issue:
- Run a 3-master docker swarm in EC2 for several days
- Sometimes after 2-4 days one of the nodes can’t resolve DNS names for services not running on it locally but running in the swarm.
- There is no step 3.
Describe the results you received: We run nginx inside our swarm as a reverse http proxy to our various services. We know that the DNS isn’t working because we have disabled nginx DNS caching and suddenly it will stop resolving the IP of the service where our application is running. When I execute nslookup inside the containers running on the effected node they will fail to find any of the services running on other nodes in the swarm, but they will find services running on the same node.
Describe the results you expected: For name resolution to continue working.
Additional information you deem important (e.g. issue happens only occasionally): The issue only happens occasionally. It is resolved if I restart the docker daemon on the node that can no longer resolve DNS
Output of docker version
:
Client:
Version: 17.05.0-ce
API version: 1.29
Go version: go1.7.5
Git commit: 89658be
Built: Thu May 4 22:10:54 2017
OS/Arch: linux/amd64
Server:
Version: 17.05.0-ce
API version: 1.29 (minimum version 1.12)
Go version: go1.7.5
Git commit: 89658be
Built: Thu May 4 22:10:54 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
Containers: 19
Running: 12
Paused: 0
Stopped: 7
Images: 24
Server Version: 17.05.0-ce
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 195
Dirperm1 Supported: true
Logging Driver: json-file
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: active
NodeID: 6mugebxyus7dgoip9i165mj64
Is Manager: true
ClusterID: mn6l9qnshdxzzxfoxwpsa18xe
Managers: 3
Nodes: 3
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 10.0.46.77
Manager Addresses:
10.0.101.134:2377
10.0.109.151:2377
10.0.46.77:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-57-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.674GiB
Name: ip-10-0-46-77
ID: XMND:RWXY:RGVB:F4BK:TDA3:LQ3W:URQH:T6ES:HLQE:74A7:FQG4:RIYY
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: authentiseautomation
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Additional environment details (AWS, VirtualBox, physical, etc.): AWS EC2 instances
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 23 (5 by maintainers)
@EliRibble In the swarm mode Service Discovery information is maintained at a per network per service level. swarm overlay network is created in the manager but the actual kernel data path instantiation happens at the node level when there is a task scheduled on that node and attached to a network. That network is cleaned up from the kernel data path when the last task goes down (network still remains at the swarm manager level). The same happens for the service information as well, ie: when there are no more tasks on a given network all the service info associated with that network are cleaned up.
In a scenario where tasks are going up & down on a network in a given node, depending on how quickly that happens there can be races between creation/deletion of service and network structures. Some of these races weren’t handled correctly leading to inconsistent state. This glosses over gory details, but thats the gist of it.
These are the fixes that went in to address this… https://github.com/docker/libnetwork/pull/1796 https://github.com/docker/libnetwork/pull/1792 https://github.com/docker/libnetwork/pull/1808
Please note that 17.06 is not GA yet, and there can be additional changes to the RC before the GA. Its ok to try it out in a test env though.