swarmkit: Service is not DNS resolvable from another one if containers run on different nodes
I have two services running a single container each, on different nodes, using same “overlay” network. When I try to ping one container from inside the other via service name, it fails:
ping akka-test
ping: bad address 'akka-test'
After I scaled the akka-test
service so that a container runs on the node where the other container is running, everything suddenly starts to work.
So my questing is: is my assumption valid that services should be discoverable across entire Swarm? I mean, name of a service should be DNS resolvable from any other container in this Swarm, no matter where containers are running.
$ docker network ls
NETWORK ID NAME DRIVER SCOPE
255fedab2fc4 bridge bridge local
9a450f033c48 docker_gwbridge bridge local
6e76844033f8 host host local
dzwgdein8cxa ingress overlay swarm
54uqc60vx1i5 net2 overlay swarm
d632a42ef140 none null local
$ docker service ls
ID NAME REPLICAS IMAGE COMMAND
0wyv4gq14mnu akka-test 8/8 xxxx:5000/akkahttp1:1.20
cg7r4ius7xfm akka-test-2 1/1 xxxx:5000/akkahttp1:1.20
$ docker service inspect --pretty akka-test
ID: 0wyv4gq14mnuj8kfolizh1h23
Name: akka-test
Mode: Replicated
Replicas: 8
Placement:
UpdateConfig:
Parallelism: 1
On failure: pause
ContainerSpec:
Image: xxxx:5000/akkahttp1:1.20
Resources:
Networks: 54uqc60vx1i57d3qnmhza82c4
$ docker service inspect --pretty akka-test-2
ID: cg7r4ius7xfmgvazmptvarn2k
Name: akka-test-2
Mode: Replicated
Replicas: 1
Placement:
UpdateConfig:
Parallelism: 1
On failure: pause
ContainerSpec:
Image: xxxx:5000/akkahttp1:1.20
Resources:
Networks: 54uqc60vx1i57d3qnmhza82c4
$ docker info
Containers: 75
Running: 11
Paused: 0
Stopped: 64
Images: 42
Server Version: 1.12.1-rc1
Storage Driver: devicemapper
Pool Name: docker-253:0-135409124-pool
Pool Blocksize: 65.54 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 8.291 GB
Data Space Total: 107.4 GB
Data Space Available: 40.86 GB
Metadata Space Used: 19.61 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.128 GB
Thin Pool Minimum Free Space: 10.74 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.107-RHEL7 (2016-06-09)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: null overlay host bridge
Swarm: active
NodeID: ao1wz862t6n4yog4hpi4yqm20
Is Manager: true
ClusterID: 3hpbbe2jtdoqe1zvxs41cycoq
Managers: 3
Nodes: 4
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: xxxx
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.28.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 56
Total Memory: 188.6 GiB
Name: xxxx
ID: OWEH:OIIR:7NZ6:IKZV:RFJ4:NXAZ:NH7H:WPLC:D457:DKGN:CH2C:E2UE
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
127.0.0.0/8
About this issue
- Original URL
- State: open
- Created 8 years ago
- Reactions: 26
- Comments: 44
I got it working now. Here are some insights that may help others:
My main fault was using Windows and not specifying --advertise-addr - since I thought the IP address of the master was already specified correctly by the generated join token cmd. But it’s important to specify the worker node IP as well on join!
I hope that helps someone. Most of the stuff is mentioned in the documentation and here in the comments. But only the combination of the mentioned points worked for me.
BTW: I’ve tested this with docker-compose v3.3 syntax and deployed it via
docker stack deploy
with the default overlay network. As a kernel I used the Ubuntu 16.04 LTS, 4.4.0-93-generic kernel.So I’m still stumped by this…
Given three AWS nodes running in a private VPC subnet with the security group set to allow all traffic in and out on all ports, both UDP, and TCP on the subnet 11.0.0.0/8, I still cannot get obtain the IP address of services running on other nodes in the docker swarm. Any services running on a node in the swarm can get IP address for services running on the same node.
How to reproduce… 1 - Create an attachable network (
docker-compose
version 3 files still don’t support attachable overlay networks)2 - Start the following docker-compose file with
docker stack deploy --compose-file=docker-compose.yml alpha
This is a stripped down sample that creates a consul cluster. I’ve left out some of the other services from the compose file.The above fails, since container running consul1 cannot resolve consul2 and consul3 into IP addresses.
And, if I manually attach to the container for consul1 with
docker exec -it <container_id> /bin/sh
, I cannslookup
services running on the same node, but not services running on a different node.(userdb in the list above is another service in the compose file… left out for brevity’s sake)
I can reach the name server at
127.0.0.11
just fine inside the container for consul1, but it seems as if IP addresses for services running on other nodes aren’t getting synchronized in the swarm network.Do you have TCP port 7946 open on your hosts? Gossip needs that port open for networking to work correctly.
In my case, it turned out to be related to ports. It worked after following ports were accessible between nodes:
Reference: https://docs.docker.com/engine/swarm/swarm-tutorial/#open-protocols-and-ports-between-the-hosts
Looking at the same issue in my setup, but I’m noticing something odd. I have 1 manager and 8 worker nodes. 5 of my 8 worker nodes fail to resolve the service name over the overlay network. The other 3 have no issue in doing so. No matter what service I launch or how I launch it, so long as it’s connected to the correct overlay network, the same 3 have no issue resolving by service name.
I have absolutely no idea why the other 5 nodes in my swarm continue to have problems. I’ve tried the quick fixes listed in this thread to no avail. Each of my worker nodes are identical copies of each other.
--advertise-addr
was the silver bullet for us. Documentation says you can use this switch with NIC’s name like--advertise-addr eth0:2377
where eth0 is address independent and fits your requirement of nodes with dynamic IP address. Same as we have.See –advertise-addr value
Almost googled my ass off to finally find this valuable suggestion. This windows platform issue really bugs me.
I’m seeing this too. I’m using Docker for AWS and this has happened both on beta4 and now on beta5. Service names are sometimes unresolvable, sometimes resolvable but no route to host. It also works sometimes. I’ve been so far unable to reliably reproduce from scratch.