moby: Docker swarm randomly stops taking connections

Description

I have 2 swarm managers and one swarm node in a testing/dev environment and every so often at random times, it just stops receiving connections on the ports. The docker service is still running and is not throwing any errors in the error log. If I restart the docker swarm service on the node, it starts everything back up and works again, but only for a few days, then stops. I thought it might be the firewall so I turned it off, but the problem still happens even with the firewall off. Anyone else having this issue?

Steps to reproduce the issue: I am unable to reproduce this, it just happens at random times.

Describe the results you received: I can not connect to any of the open ports or any of the services.

Describe the results you expected: For the service to be running properly.

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 8
 Running: 8
 Paused: 0
 Stopped: 0
Images: 6
Server Version: 17.03.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: 53cwvu71y8c2cso3cg6ojb4fz
 Is Manager: false
 Node Address: 10.0.1.0
 Manager Addresses:
  10.0.0.2:2377
  10.0.0.3:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-62-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 488.4 MiB
Name: sms-swarm-01
ID: EGUA:D6CP:ZERE:BHEU:YUBE:GTK3:VGEU:R3Z4:VVVC:B7IZ:UCIY:3ATE
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.): Running Ubuntu 16.04 on Digital Ocean on all nodes.

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 7
Comments: 20 (5 by maintainers)

Most upvoted comments

Just had a similar problem on 17.05.0-ce, build 89658be

In production, one of our application started having connexion timeout issue. The replicas were sometimes flapping because the application healthcheck wasn’t working but all the application dependencies were running great (mysql, redis etc) and container healthcheck was working locally in the containers.

We were able to reproduce the timeout from the docker node with a curl 0.0.0.0:{servicePort}

Impossible from that point to know which container might be failing, we started scaling up and down the containers to reboot them all. Same problem.

So we suspect one of the docker node to have some network issue. We started draining the nodes one at a time and the problem persist after all.

So we finally docker service rm the application service on the swarm manager and redeploy it, problem solved.

Some kind of network glitch in the swarm load balancing layer ? No clue. First time it occured in 2 years.

Unfortunately, I have no logs to provide at all. Nothing of interest in our whole logging stack regarding that issue. All I can tell is the container healthcheck was receiving a timeout, and it was the only application with that issue in the cluster.

¯_(ツ)_/¯

JnMik on Sep 8, 2017

I am having the same issue with docker 17.03 on Ubuntu Azure.

I don’t have swarm mode enabled. Just a single docker node.

An nginx container is running and binds to 80->80 and 443->443

45456d68f7a7        nginx:latest    "/bin/sh -c 'nginx..."   3 hours ago         Up 3 hours          0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp   nginx

At some point (not sure when, have not looked much into it) I can’t reach the container through the eth0 interface.

nc -vz 10.1.4.223 80
nc: connect to 10.1.4.223 port 80 (tcp) failed: Connection refused

Localhost works fine

nc -vz localhost 80
Connection to localhost 80 port [tcp/http] succeeded!

giskou on Mar 9, 2017

I actually forgot about this issue. I have not experienced this anymore on the new versions of docker (I have experienced other issues especially with docker 18.09 on boot2docker) so I am closing this.

burgoyn1 on Dec 4, 2018

Same issue here Docker version 17.11.0-ce, build 1caf76c

gnasr on Dec 23, 2017