moby: Swarm Mode: Overlay networks intermittently stop working
Description
I’m running a docker swarm cluster. Running Centos7 with the LTS kernel. They’re brand-new machines. I’ve been deploying and updating services for about an hour (iterating on testing changes). I have one overlay network I created that I just called app
and everything communicates over the network app
. So config for all my containers has someotherapplocation=someother.app
for example.
I have three nodes in the swarm: two workers, one master. After about an hour, the two workers started giving me the error “network app
doesn’t exist”. However, docker network ls
shows the overlay network app
on all nodes. However, no DNS names resolve on existing containers on the working nodes. And eventually they all crash out and swarm restarts them and fails because the above error.
Steps to reproduce the issue:
- Create a swarm
- Start services, make sure you have services on each node all using the same overlay network, update them. Wait and hope you get lucky
- Observe that the workers can no longer use the
app
network
Describe the results you received: See above
Describe the results you expected: I expect overlay networks to work in Swarm Mode
Additional information you deem important (e.g. issue happens only occasionally): Intermittently happens. I am running on EC2 all on the same VPC, same Subnet, same Security group. But I have seen this happen locally as well with local kvm machines on my dev workstation.
Output of docker version
:
Client:
Version: 1.12.3
API version: 1.24
Go version: go1.6.3
Git commit: 6b644ec
Built:
OS/Arch: linux/amd64
Server:
Version: 1.12.3
API version: 1.24
Go version: go1.6.3
Git commit: 6b644ec
Built:
OS/Arch: linux/amd64
Output of docker info
:
Containers: 17
Running: 1
Paused: 0
Stopped: 16
Images: 4
Server Version: 1.12.3
Storage Driver: overlay
Backing Filesystem: xfs
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge overlay null host
Swarm: active
NodeID: 70ycgf7413u2l39ts9f69dxxg
Is Manager: false
Node Address: 10.33.11.19
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 4.4.31-1.el7.elrepo.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.858 GiB
Name: ip-10-33-11-19.blah.us.example.com
ID: UJQH:47E5:BBZX:MZDZ:RS7D:MSFG:G4RJ:P56T:M3P2:RZXF:5GJR:UZA2
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Labels:
com.example.region=ca
com.example.country=us
com.example.hostname=dockerswarmtest-11-19-joeljohnson
com.example.domain=dockerswarmtest-11-19-joeljohnson.ca.us.example.com
Insecure Registries:
127.0.0.0/8
Additional environment details (AWS, VirtualBox, physical, etc.):
AWS, also have seen using Vagrant + libvert + kvm
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 11
- Comments: 18 (4 by maintainers)
+1 on this, I’ve been experiencing it for a while but was unable to reliably recreate it (should have just posted anyway)… I have found that I experience this issue in two different cases:
while using an overlay network (ie. adding
--network proxy
option on service creation, after creating aproxy
overlay) I find that it actually happens to me quite often, so much so that I couldn’t use the system in production without a script to check for failures and burn down the cluster and restart it.I have also tried my services without the
--network
option my containers can access each other just fine even without an overlay network (not sure if this is expected behavior - if an overlay is designed to let containers in a swarm communicate but they can without it, then am I missing something??)Both of these cases I’ve been dealing with for a few weeks in a cluster on AWS using
docker-machine
to setup the cluster, I can provide more details just let me know what you would like to see. I’m using docker 1.12.3The same here, Docker 1.12.6 on Ubuntu 16.04 in AWS.
Services access each other by service name as host address. For example, NGINX accesses uWSGI Python servers by
uwsgi_pass api_staging:3031;
, and Python server accesses PostgreSQL bySQLALCHEMY_DATABASE_URI=postgresql://xxx:xxx@pg_staging/xxx
, wheredocker service ls
looks like:All services are attached to this overlay network:
Logs are using driver
awslog
, some typical errors looks like this:Meanwhile, AWS load balancer loses the NGINX services from time to time too, with
504 Gateway Time-out
.I have same issue. It was on 1.12.3 and after upgrade to 1.12.4 - all the same. Containers in overlay network see each other 2-3 hours and then lose connections randomly.
I experience the same problem. I try docker swarm now and then but I always had to backtrack because networking simply isn’t working reliably.
@mrjana I have checked the logs in the docker daemon that is running the container that is not reachable and there really is no useful info. I can bash enter into one of the containers that are in the same network as the unreachable container, and when I curl it it just hangs and times out after some time. The logs do not show anything at all regarding this access. My guess is there are more logs that handle the rounting mesh where we may find more useful information, do you know where those are?