moby: Service network alias not working for services created using stack deploy on 1.13
Description
When creating a stack using docker stack deploy --compose-file=./docker-compose.yml
I am not able to resolve the alias as a hostname from within the network.
Docker compose file:
version: "3"
services:
elasticsearch:
image: elasticsearch:2
command: elasticsearch -Des.network.host=0.0.0.0
networks:
default:
aliases:
- elasticsearch.service.acme
deploy:
replicas: 1
placement:
constraints:
- node.labels.role.tools==true
healthcheck:
test: "/usr/bin/curl --silent --fail -o /dev/null -w %{http_code} http://localhost:9200 || exit 1"
interval: 30s
timeout: 10s
retries: 3
volumes:
- elasticsearch:/usr/share/elasticsearch/data
logstash:
image: logstash:2
command: logstash -f /etc/logstash/conf.d/logstash.conf
networks:
default:
aliases:
- logstash.service.acme
depends_on:
- elasticsearch
ports:
- "12001:12001/udp"
deploy:
mode: global
placement:
constraints:
- node.labels.role.tools==true
healthcheck:
test: "/bin/nc -z -w3 localhost 5000 || exit 1"
interval: 30s
timeout: 10s
retries: 3
networks:
default:
driver: overlay
volumes:
elasticsearch:
driver: local
Service Inspect ouput:
[
{
"ID": "1wrccq7kgzn6ztjgwonuyirfk",
"Version": {
"Index": 5621
},
"CreatedAt": "2017-01-20T13:00:09.007260649Z",
"UpdatedAt": "2017-01-20T13:00:09.008047248Z",
"Spec": {
"Name": "test_elasticsearch",
"Labels": {
"com.docker.stack.namespace": "test"
},
"TaskTemplate": {
"ContainerSpec": {
"Image": "elasticsearch:2@sha256:0b0b493c0ad7c0af88bcecc2060f8e16feea1713c908570ab9872fe8ada919ca",
"Labels": {
"com.docker.stack.namespace": "test"
},
"Args": [
"elasticsearch",
"-Des.network.host=0.0.0.0"
],
"Mounts": [
{
"Type": "volume",
"Source": "test_elasticsearch",
"Target": "/usr/share/elasticsearch/data",
"VolumeOptions": {
"Labels": {
"com.docker.stack.namespace": "test"
},
"DriverConfig": {
"Name": "local"
}
}
}
],
"Healthcheck": {
"Test": [
"CMD-SHELL",
"/usr/bin/curl --silent --fail -o /dev/null -w %{http_code} http://localhost:9200 || exit 1"
],
"Interval": 30000000000,
"Timeout": 10000000000,
"Retries": 3
}
},
"Resources": {},
"Placement": {
"Constraints": [
"node.labels.role.tools==true"
]
},
"ForceUpdate": 0
},
"Mode": {
"Replicated": {
"Replicas": 1
}
},
"Networks": [
{
"Target": "wmto6bxovgfv3fuxz5r2ql862",
"Aliases": [
"elasticsearch.service.acme",
"elasticsearch"
]
}
],
"EndpointSpec": {
"Mode": "vip"
}
},
"Endpoint": {
"Spec": {
"Mode": "vip"
},
"VirtualIPs": [
{
"NetworkID": "wmto6bxovgfv3fuxz5r2ql862",
"Addr": "10.0.1.13/24"
}
]
},
"UpdateStatus": {
"StartedAt": "0001-01-01T00:00:00Z",
"CompletedAt": "0001-01-01T00:00:00Z"
}
}
]
Steps to reproduce the issue:
- Run the stack from a docker-compose file as above (assigning a network alias)
- Use ping to check hostnames
Describe the results you received:
root@logstash:/# ping test_elasticsearch
PING test_elasticsearch (10.0.1.15): 56 data bytes
64 bytes from 10.0.1.15: icmp_seq=0 ttl=64 time=0.142 ms
64 bytes from 10.0.1.15: icmp_seq=1 ttl=64 time=0.189 ms
64 bytes from 10.0.1.15: icmp_seq=2 ttl=64 time=0.105 ms
root@logstash:/# ping elasticsearch
ping: unknown host
root@logstash:/# ping elasticsearch.service.acme
ping: unknown host
Describe the results you expected:
All 3 above dns names to work
Additional information you deem important (e.g. issue happens only occasionally):
Output of docker version
:
Client:
Version: 1.13.0
API version: 1.25
Go version: go1.7.3
Git commit: 49bf474
Built: Tue Jan 17 10:05:19 2017
OS/Arch: linux/amd64
Server:
Version: 1.13.0
API version: 1.25 (minimum version 1.12)
Go version: go1.7.3
Git commit: 49bf474
Built: Wed Jan 18 16:20:26 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
Of manager
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 10
Server Version: 1.13.0
Storage Driver: aufs
Root Dir: /mnt/sda1/var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 10
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: active
NodeID: nf60k2co676qt6s8rykklvqv5
Is Manager: true
ClusterID: mae52778wpstynpij553uji54
Managers: 1
Nodes: 3
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 192.168.99.100
Manager Addresses:
192.168.99.100:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.4.43-boot2docker
Operating System: Boot2Docker 1.13.0 (TCL 7.2); HEAD : 5b8d9cb - Wed Jan 18 18:50:40 UTC 2017
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 995.8 MiB
Name: swnode1
ID: S5C6:YG6T:SBP3:HS3K:5T7P:HHT4:DABN:3QVO:JPXH:6Z6G:NKKA:DC2Q
Docker Root Dir: /mnt/sda1/var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 43
Goroutines: 187
System Time: 2017-01-20T13:14:04.088855925Z
EventsListeners: 0
Username: byrnedo
Registry: https://index.docker.io/v1/
Labels:
provider=virtualbox
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.):
Using docker-machine virtualbox driver and swarm mode.
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 17
- Comments: 36 (6 by maintainers)
Not sure if it is a right place to mention my problem because there is a lot of issues related (https://github.com/docker/swarmkit/issues/1242, https://github.com/docker/docker/issues/27080, https://github.com/docker/docker/issues/30447, so on). But it would be nice if there existed any way of reaching container(task) from other service on the same network ON THE SAME NODE. Use case: running a log agregator on each node as a global service, all other services do not need to use routing mesh to talk to it as it would cause substantial network traffic.
I think it would be very elegant and powerful to allow temples in alias names:
then in other services we could provide environment variable pointing to DNS name of that service in exactly the same form (“{{.Node.ID}}-serviceName”). This would also solve a lot of other requirements regarding DNS. Everybody has a slightly different preferences, but this way most of them would be fulfilled.
I understand that it makes harder to distinct if alias is for task or service, but it would be powerful:)
Still not working on version 19.03.5
Still seeing this on 18.09 when adding a network with an alias via
docker service update
- It’s not an issue ondocker service create
I can confirm that the following format works & has resolved a big pain for me with dns
So incase you are deploying the stack with as followed :
docker stack deploy -c network-alias.yml webstack
Your service should be resolvable as with the following dns:
In case it is useful, what I am currently seeing, even on a swarm with a single manager node (Windows Swarm, running on server 2016 1709 Build) is that:
This is frustrating in my current development environment, but is absolutely unworkable for a production environment, which is where I am hoping to be heading soon.
hey @theecodedragon @byrnedo, just tried here and having
and issuing
you can inspect:
docker inspect teststack_webService
where you can see:you can then
exec
into one of the containers and check:does that fit what you expect?
I had the same problem where services were not able to communicate when deployed across node.
It was due to firewall blocking the communication between nodes.
Make sure you enable the following ports when running multi node docker services
Joined a swarm and left a swarm and since then the hostnames aren’t working. Version: 20.10.2
I figured out what it is … the ym_default network isn’t cleaned up on my swarm nodes after I disband the swarm. So my journey :
I upgraded to 17.12.1-CE and nothing … problem still happening.
Then I got mad. 😃
Without elaborating how cause I have a fair bit of automation controlling my stack deploy I added code to “remove the ingress network off all the nodes of the swarm (and the manager) when the swarm is disbanded”.
Now at this point I’m running my automation and I see the swarm get disbanded and then eventually the stack is rm’d … but since I’m making sure the ingress should be removed I basically started watching all the nodes in the swarm and what I noticed is that removing the ingress network wasn’t enough … that the nodes that didn’t make it to the peer list on the manager where the ones that still had a “ym_default” network on them EVEN THOUGH I PREVIOUSLY did “docker swarm leave --force” and “docker stack rm ym” (and HAD NOT YET issued the next “docker stack deploy --with-registry-auth -c /tmp/docker-compose-stack.yml ym”).
So I updated my automation to include the remove of “ym_default” and so far I’ve been cycling through deployments of 2 different builds and things haven’t been failing. Removal of “ym_default” and “ingress” networks on all the nodes/manager before the next deploy seems to be getting the peer list generated correctly when I eventually issue the next stack deploy.
Obviously I will do more testing cause there a few cases here (deploying to a new set of VMs never having a swarm on them VS, deploy to the same set of VMs that I remove a swarm from). But that’s what I see when things aren’t working : the old “default” network from the last stack deploy isn’t being cleaned up even though I forced swarm removal off every node and manager in the swarm and did a stack rm. If a swarm existed I’m having to clean up the previous stack’s default network off the swarm nodes & manager even though they had “docker swarm leave --force” run on them on all & “docker stack rm” BEFORE issuing the next “stack deploy ym” command.
I’m noticing that the list of Peers isn’t always established correctly when pushing my stack from a compose file. Let me walk you through it :
After creating a swarm I run this on the manager :
docker stack deploy --with-registry-auth -c /tmp/docker-compose-stack.yml ym (this file contains a service called “pipeline-app”)
and then on that same manager I run :
docker network inspect ym_default
This shows a list of Peers … and when things go correctly all my swarm nodes (the manager and my workers) show up in that peer list. When my service (pipeline-app) has trouble I notice that the address of the swarm node trying to run pipeline-app IS NOT in that peer list !
I’m currently updating my compose file to include ->
for every service in it.
Strange … it all came up the 1st time in 1m:11s … and then the next time (and I did a “docker stack rm ym” before calling “stack deploy”) the peer list didn’t include the swarm node for 5m:29s (and while it didn’t my pipeline-app service wasn’t too happy).
And now I noticed this issue in docker : https://github.com/moby/moby/pull/36003 and I’m adding “docker network rm ingress” on each node in the swarm (to be issued after each node has left it).
Yeah I can’t put my finger on it but I added ingress cleanup. Now it’s at my swarm and services are coming up every time ALTHOUGH sometimes the peer list is created correctly within 2 minutes and sometimes it takes up to 6 minutes for the swarm to sort things out.
I think I’m running into the same issue … I have a stack with a swarm manager and two more swarm workers (so 3 VMs each running docker 12.12). The 1st docker host/swarm worker has a label such that my service gets put on that named host, and the other docker host/swarm worker has a label of mongo and runs mongo:latest. The last VM is the manager that I perform the stack deploy on. My point of explaining my setup is every piece of my swarm is on a different worker/VM/docker host.
Now to my point on the weirdness I’m seeing with networking : sometimes the container where my service is running can ping the container running mongo which again is in the same swarm but on a different docker host/VM/swarm worker AND THEN SOMETIMES AFTER SWARM DESTRUCTION, and re-setup of the swarm the service can’t ping mongo. So something clearly happens with the network on redeploys of the stack. The behavior isn’t consistent … sometimes the ‘recreate’ works. I’m using 17.12.0-ce, build c97c6d6 on all the nodes of my swarm.
I’m off to compare the network differences as we speak, but clearly sometimes the network doesn’t get initialized correctly.
Here is another test that shows that DNS resolution just seems to be very unreliable.