moby: Port publishing is broken in 17.05 Swarm Mode

Description

All the time I used command like this:

docker service create --name logstash --hostname logstash --mode replicated --endpoint-mode vip --with-registry-auth --log-driver=json-file --stop-grace-period=20s --restart-delay=20s  --network onenet --publish 12203:12203/tcp --publish 12203:12203/udp docker.elastic.co/logstash/logstash:5.3.0

… to start one logstash container and make it listen to port 12203 (tcp/udp) on all nodes. It always works as expected: process dockerd starts to listen on this ports on all nodes in a cluster and I was able to send messages to localhost:12203.

But after upgrading from 17.04 to 17.05 - this feature is totally broken. After running this command port 12203 is closed, and I don’t see dockerd process listened to it. I’ve tried different variations:

  • 12203:12203
  • 12203:12203/tcp
  • 12203:12203/udp
  • mode=ingress,target=12203,published=12203,protocol=tcp
  • mode=ingress,target=12203,published=12203,protocol=udp

I’ve also tried to rename container and change port number (earlier I had a problem, when port was closed, but there was a wrong record in Docker key-value storage and it thought that port is already in use) - no success. And I even don’t see any error messages in logs.

docker service ls still reports that all is fine:

# docker service ls | grep logs
aibwagzfoq8k        logstash                             replicated          1/1                 docker.elastic.co/logstash/logstash:5.3.0                                               *:12203->12203/tcp,*:12203->12203/udp

Also, when I change command to mode=host,target=12203,published=12203,protocol=tcp - I see docker-proxy started and listened to this port:

# netstat -ltnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6       0      0 :::12203                :::*                    LISTEN      26462/docker-proxy

Looks like now I need to create global service and run it on all nodes in cluster. But I want to get back old behavior, when one container get packets from all nodes via dockerd and ingress network.

Output of docker version:

Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:06:06 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:06:06 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info on master:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 6
Server Version: 17.05.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 41
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: rus10aj9e62s5kdiqpu8rdp6t
 Is Manager: true
 ClusterID: hqvohft3etj4ajnkgubbnjwzp
 Managers: 4
 Nodes: 18
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: <ip1-here>
 Manager Addresses:
  <ip2-here>:2377
  <ip3-here>:2377
  <ip1-here>:2377
  <ip4-here>:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Kernel Version: 3.16.0-4-amd64
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.963GiB
Name: dmgr-01
ID: VUNH:H6FP:CO4N:O6VB:CFCG:4T32:JQQV:SIS3:UGT7:V2FK:46VS:PRKZ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: filiatixbot
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: No oom kill disable support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support

Output of docker info on worker:

Containers: 4
 Running: 4
 Paused: 0
 Stopped: 0
Images: 17
Server Version: 17.05.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 137
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: qcwrqbby4jce906gjm1h1f3ts
 Is Manager: false
 Node Address: <ip1-here>
 Manager Addresses:
  <ip2-here>:2377
  <ip3-here>:2377
  <ip4-here>:2377
  <ip5-here>:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
 apparmor
Kernel Version: 4.4.0-47-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.67GiB
Name: dwrk-service-elk-01
ID: ZEOA:4TGE:ULAT:MZMP:SDYG:I3XT:ZHWC:OLWR:LSXF:6A73:ERN6:5QIB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: filiatixbot
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

Docker Swarm mode running on hybrid infrastructure with 4 master nodes and ~10 workers. Upgraded from 17.04.

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Comments: 39 (8 by maintainers)

Most upvoted comments

I’m having a similar issue with a 4 (ubuntu) nodes swarm running about 60 services (mainly logstash, influx, grafana, ngnix in vip) after upgrading 17.04 to 17.05 - for some services the published port is not always available on some nodes. Redeploying service or constraining it to a node won’t solve it, rolling restart of the swarm makes it right again. [later edit] had to downgrade to 17.04, re-create the swarm - as the issue cascaded to the point where most services were unreachable.

I had the same problem after upgrading from 17.03 to 17.05. Like @one1zero1one I had to completely destroy the swarm and re-create it from scratch. When it was still broken, I could see that the DOCKER-INGRESS filter chain in iptables was even empty. I tried manually inserting suitable rules, but that wasn’t enough to get ingress working again.

(FYI, many of the services in my swarm had been manually created and I didn’t know a way to export the service definitions. So I wrote a tool which dumps the services of a running swarm into docker-compose v3.2 files which are ready for deploying into a new swarm via docker stack deploy. That helped enormously in getting back running. Will publish the tool soon!)

I have issues with ingress to.

docker network create -d overlay web3

docker service create \
--network web3 \
--publish published=80,target=80 \
nginx

telnet 127.0.0.1 80
Trying 127.0.0.1...
telnet: Unable to connect to remote host: Connection refused

It I use mode=host, it works:

docker service rm {service_name}
docker service create \
--network web3 \
--publish published=80,target=80,mode=host \
nginx

Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.

I use firewalld but it is disabled.

Server Version: 17.12.1-ce

EDIT: I am unable to reproduce this on 2 KVM virtual machines… I think the issue is realted to the hosting plateform (scaleway, non standard kernel).

Same here with 18.01 and ubuntu 16.04, updated will be much appreciated.

Last night I’ve dropped my cluster, re-created it and re-deployed all services with same playbooks - port publishing works as expected.

I think this was my last upgrade. Each new release brings more problems to system.