moby: Swarm Mode: Overlay networks intermittently stop working

Description

I’m running a docker swarm cluster. Running Centos7 with the LTS kernel. They’re brand-new machines. I’ve been deploying and updating services for about an hour (iterating on testing changes). I have one overlay network I created that I just called app and everything communicates over the network app. So config for all my containers has someotherapplocation=someother.app for example.

I have three nodes in the swarm: two workers, one master. After about an hour, the two workers started giving me the error “network app doesn’t exist”. However, docker network ls shows the overlay network app on all nodes. However, no DNS names resolve on existing containers on the working nodes. And eventually they all crash out and swarm restarts them and fails because the above error.

Steps to reproduce the issue:

  1. Create a swarm
  2. Start services, make sure you have services on each node all using the same overlay network, update them. Wait and hope you get lucky
  3. Observe that the workers can no longer use the app network

Describe the results you received: See above

Describe the results you expected: I expect overlay networks to work in Swarm Mode

Additional information you deem important (e.g. issue happens only occasionally): Intermittently happens. I am running on EC2 all on the same VPC, same Subnet, same Security group. But I have seen this happen locally as well with local kvm machines on my dev workstation.

Output of docker version:

Client:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   6b644ec
 Built:        
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   6b644ec
 Built:        
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 17
 Running: 1
 Paused: 0
 Stopped: 16
Images: 4
Server Version: 1.12.3
Storage Driver: overlay
 Backing Filesystem: xfs
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge overlay null host
Swarm: active
 NodeID: 70ycgf7413u2l39ts9f69dxxg
 Is Manager: false
 Node Address: 10.33.11.19
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 4.4.31-1.el7.elrepo.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.858 GiB
Name: ip-10-33-11-19.blah.us.example.com
ID: UJQH:47E5:BBZX:MZDZ:RS7D:MSFG:G4RJ:P56T:M3P2:RZXF:5GJR:UZA2
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Labels:
 com.example.region=ca
 com.example.country=us
 com.example.hostname=dockerswarmtest-11-19-joeljohnson
 com.example.domain=dockerswarmtest-11-19-joeljohnson.ca.us.example.com
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):

AWS, also have seen using Vagrant + libvert + kvm

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 11
  • Comments: 18 (4 by maintainers)

Most upvoted comments

+1 on this, I’ve been experiencing it for a while but was unable to reliably recreate it (should have just posted anyway)… I have found that I experience this issue in two different cases:

  1. while using an overlay network (ie. adding --network proxy option on service creation, after creating a proxy overlay) I find that it actually happens to me quite often, so much so that I couldn’t use the system in production without a script to check for failures and burn down the cluster and restart it.

  2. I have also tried my services without the --network option my containers can access each other just fine even without an overlay network (not sure if this is expected behavior - if an overlay is designed to let containers in a swarm communicate but they can without it, then am I missing something??)

Both of these cases I’ve been dealing with for a few weeks in a cluster on AWS using docker-machine to setup the cluster, I can provide more details just let me know what you would like to see. I’m using docker 1.12.3

The same here, Docker 1.12.6 on Ubuntu 16.04 in AWS.

Services access each other by service name as host address. For example, NGINX accesses uWSGI Python servers by uwsgi_pass api_staging:3031;, and Python server accesses PostgreSQL by SQLALCHEMY_DATABASE_URI=postgresql://xxx:xxx@pg_staging/xxx, where docker service ls looks like:

ID            NAME            REPLICAS  IMAGE                                                                                                     COMMAND
2q3s53ahuxcp  pg_staging      1/1       ...
891qfjfxjwpo  web_staging     2/2       ...
d83kxfzdlozv  api_staging     8/8       ...

All services are attached to this overlay network:

NETWORK ID          NAME                DRIVER              SCOPE
...
duz304pwarm9        staging             overlay             swarm               

Logs are using driver awslog, some typical errors looks like this:

## NGINX to Python
10.255.0.3 - xxx [16/Jan/2017:08:25:25 +0000] "GET /xxx/xxx/ HTTP/1.1" 502 575 "https://xxx.xxx.xxx/xxx/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36" "xxx.xxx.xxx.xxx"

## Python to PostgreSQL
psycopg2.OperationalError: server closed the connection unexpectedly

Meanwhile, AWS load balancer loses the NGINX services from time to time too, with 504 Gateway Time-out.

I have same issue. It was on 1.12.3 and after upgrade to 1.12.4 - all the same. Containers in overlay network see each other 2-3 hours and then lose connections randomly.

I experience the same problem. I try docker swarm now and then but I always had to backtrack because networking simply isn’t working reliably.

@mrjana I have checked the logs in the docker daemon that is running the container that is not reachable and there really is no useful info. I can bash enter into one of the containers that are in the same network as the unreachable container, and when I curl it it just hangs and times out after some time. The logs do not show anything at all regarding this access. My guess is there are more logs that handle the rounting mesh where we may find more useful information, do you know where those are?