moby: Unable to get DNS resolution on swarm overlay network
Description
A swarm cluster on our internal network is using an internal DNS server address 10.0.0.20. NOTE: this is not the same network as the swarm cluster.
A swarm overlay network is using 10.0.0.0/24. Containers attached to this network can’t get DNS resolution.
/etc/resolv.conf on the HOST:
# Generated by NetworkManager
nameserver 10.0.0.20
/etc/resolv.conf in the container
nameserver 127.0.0.11
options ndots:0
ip addr in the container:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
119: eth0@if120: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue state UP
link/ether 02:42:0a:00:00:05 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.5/24 scope global eth0
valid_lft forever preferred_lft forever
inet 10.0.0.4/32 scope global eth0
valid_lft forever preferred_lft forever
121: eth1@if122: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
link/ether 02:42:ac:12:00:07 brd ff:ff:ff:ff:ff:ff
inet 172.18.0.7/16 scope global eth1
valid_lft forever preferred_lft forever
123: eth2@if124: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue state UP
link/ether 02:42:0a:ff:00:0d brd ff:ff:ff:ff:ff:ff
inet 10.255.0.13/16 scope global eth2
valid_lft forever preferred_lft forever
inet 10.255.0.12/32 scope global eth2
valid_lft forever preferred_lft forever
something that tries to resolve:
# ping google.com
ping: bad address 'google.com'
I’m seeing errors from dockerd like this:
Sep 13 16:52:21 swarm-demo1 dockerd: time="2017-09-13T16:52:21.473125855-04:00" level=error msg="Reprogramming on L3 miss failed for 10.0.0.20, no peer entry"
I changed to a different internal DNS server that is not on the 10.0.0.0/24 network and resolution works normally
# ping google.com
PING google.com (172.217.7.238): 56 data bytes
64 bytes from 172.217.7.238: seq=0 ttl=55 time=22.758 ms
64 bytes from 172.217.7.238: seq=1 ttl=55 time=9.352 ms
64 bytes from 172.217.7.238: seq=2 ttl=55 time=15.719 ms
^C
Steps to reproduce the issue: I’ve don this on several internal swarms running different versions of docker.
Describe the results you received: No DNS resolution
Describe the results you expected:
Additional information you deem important (e.g. issue happens only occasionally):
Output of docker version
:
Client:
Version: 17.06.1-ce
API version: 1.30
Go version: go1.8.3
Git commit: 874a737
Built: Thu Aug 17 22:53:49 2017
OS/Arch: linux/amd64
Server:
Version: 17.06.1-ce
API version: 1.30 (minimum version 1.12)
Go version: go1.8.3
Git commit: 874a737
Built: Thu Aug 17 23:01:50 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
Client:
Version: 17.06.1-ce
API version: 1.30
Go version: go1.8.3
Git commit: 874a737
Built: Thu Aug 17 22:53:49 2017
OS/Arch: linux/amd64
Server:
Version: 17.06.1-ce
API version: 1.30 (minimum version 1.12)
Go version: go1.8.3
Git commit: 874a737
Built: Thu Aug 17 23:01:50 2017
OS/Arch: linux/amd64
Experimental: false
[root@swarm-demo1 compose-files]# docker info
Containers: 7
Running: 7
Paused: 0
Stopped: 0
Images: 6
Server Version: 17.06.1-ce
Storage Driver: devicemapper
Pool Name: docker-thinpool
Pool Blocksize: 524.3kB
Base Device Size: 10.74GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 4.098GB
Data Space Total: 102GB
Data Space Available: 97.91GB
Metadata Space Used: 1.552MB
Metadata Space Total: 1.074GB
Metadata Space Available: 1.072GB
Thin Pool Minimum Free Space: 10.2GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: dej1htlbh6dultjpa5wt1ao9r
Is Manager: true
ClusterID: j666ldxnl401vc8et51c5ti60
Managers: 3
Nodes: 6
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Root Rotation In Progress: false
Node Address: 10.2.252.121
Manager Addresses:
10.2.252.121:2377
10.2.252.122:2377
10.2.252.123:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 6e23458c129b551d5c9871e5174f6b1b7f6d1170
runc version: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-514.26.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 6
Total Memory: 7.781GiB
Name: swarm-demo1
ID: KHSR:U7MT:TRP2:U3EZ:OFD5:VTVD:NKUV:SOYD:3LS5:KCKF:5ZNC:3ANG
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
xxx.local:5000
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.):
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 6
- Comments: 17 (4 by maintainers)
Yup, that was indeed the case for me.
My internal VPC has a CIDR of
10.0.0.0/16
whereas my default ingress network has a subnet of10.0.0.0/24
. Creating a new ingress network with non-conflicting subnet resolved it.The following are the steps I took:
Thanks for the save @thaJeztah! 🐳
I suspect this may be because the overlay network’s IP range overlaps with the physical network in your environment. On older kernels (such as is used on RHEL/CentOS distributions), networks are not namespaced, therefore causing issues; can you try creating the overlay network and specify the ip-range for the network when creating (as is explained here: https://docs.docker.com/engine/swarm/networking/#customize-an-overlay-network) ?
faced similar problem, dns resolution works well, but connection timed out, temporarily solved by
docker swarm leave
anddocker swarm join --token xxxxxx
. I think it’s time for me to migrate to kubernetes.@thaJeztah I experience the similar issue, however, not when i’m starting the containers. Cluster is starting as expected with all containers, then all of the sudden has an outage.
This is the log
Cluster can be stable for couple of days, and fail with no apparent reason
I’m running nginx container in host mode, for some reason it stops abnormally, and can’t start any more. The only way to fix it is to restart docker deamon