moby: Overlay / ingress network routing breaks on service restart

Description Overlay / ingress network routing breaks on service restart. I was able to reproduce this on in docker swarm mode on a swarm with 5 nodes and on a different swarm with 3 nodes, both running Ubuntu 16.04 and docker 1.12.1.

Steps to reproduce the issue:

When deploying a service for the first time for a port, the service is available from all nodes in a swarm

docker service create -p 84:80 --name httpd4 httpd

root@swarm2 ~ # curl http://swarm1:84
<html><body><h1>It works!</h1></body></html>

root@swarm2 ~ # curl http://swarm5:84
<html><body><h1>It works!</h1></body></html>

(works as expected)

However, now when I do

root@swarm2 ~ # docker service rm httpd4
and then 
root@swarm2 ~ # docker service create -p 84:80 --name httpd4 httpd

root@swarm2 ~ # docker service ps httpd4
ID                         NAME      IMAGE  NODE              DESIRED STATE  CURRENT STATE               ERROR
6avk26h1ks9xs7ovywas0klta  httpd4.1  httpd  swarm5            Running        Running about a minute ago


root@swarm2 ~ # curl http://swarm1:84
curl: (7) Failed to connect to swarm1 port 84: Connection timed out

root@swarm2 ~ # curl http://swarm5:84
<html><body><h1>It works!</h1></body></html>

Describe the results you received:

root@swarm2 ~ # curl http://swarm1:84
curl: (7) Failed to connect to swarm1 port 84: Connection timed out

Describe the results you expected:

root@swarm2 ~ # curl http://swarm1:84
<html><body><h1>It works!</h1></body></html>

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:33:38 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:33:38 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 3
 Running: 2
 Paused: 0
 Stopped: 1
Images: 5
Server Version: 1.12.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 43
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: overlay bridge null host
Swarm: active
 NodeID: dvj6xuikc39a6it98sywz7ckk
 Is Manager: true
 ClusterID: 389htp3t8ruwg4e1cjzg0pg3b
 Managers: 5
 Nodes: 5
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 138.201.138.24
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-36-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 62.75 GiB
Name: ...
ID: UPSO:UJ57:XQVJ:VU7M:GP7Q:XP5O:VPLR:FEYH:AXHX:ZNWC:WVAJ:UDOV
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.): 5 physical nodes (all master) in docker swarm mode. I also reproduced this in a swarm mode with 3 other physical nodes. All running Ubuntu 16.04LTS

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 10
  • Comments: 55 (25 by maintainers)

Most upvoted comments

@dangra #25962 brought the fix to master but it may not be easy to back port that particular change on top of 1.12

@niau Yes I know this but this is also what I want to do. The httpd should be available via port 84 to the outside world. even if I do a curl from another PC using the full domain name, I havbe the same results and also from a webbrowser.

For example, I can easily call http://swarm5.domain.com:84 but not http://swarm1.domain.com:84

However, indeed some nodes rebooted a while ago, so I assume that this problem is connected to #24496