moby: Docker host stops forwarding requests into the overlay network of a service from external port

I have a swarm cluster 17.10.0-ce with allot of overlay networks configured on it.

NETWORK ID          NAME                               DRIVER              SCOPE
vfy20700nst4        ashburn_default                    overlay             swarm
ofbwxri81ksu        automationhub_default              overlay             swarm
89ua249d1vyk        bambrain_bam                       overlay             swarm
xecb7jjdpxgk        bambrain_default                   overlay             swarm
1sjorpddaull        bamsite_bam                        overlay             swarm
vhugnem8ji74        bamsite_default                    overlay             swarm
i6lqqsiand8b        bamweb_bam                         overlay             swarm
ttkudolbfqay        bamweb_default                     overlay             swarm
119408fb04ca        bridge                             bridge              local
mtc1mw0wfk1y        data-science-research_default      overlay             swarm
pkxgh1zrbqwm        data-science_default               overlay             swarm
a7d5a27b1c4e        docker_gwbridge                    bridge              local
08b614c05eed        host                               host                local
fgsk9gwbuoxa        ingress                            overlay             swarm
6upju39rz8u3        logzio                             overlay             swarm
qcchirkkt2v6        nibam_default                      overlay             swarm
a066293749a2        none                               null                local
oqds535acu47        production_backend_default         overlay             swarm
w5ondcslnzau        production_bb_blocks               overlay             swarm
fv19qc0hc3uo        production_bst_boost               overlay             swarm
qrqk3lo455f6        production_bst_default             overlay             swarm
fk2rbi1acydj        production_cirrus                  overlay             swarm
2nfpyslcjaza        production_cms                     overlay             swarm
ro6iwgwk83a8        production_data-validation         overlay             swarm
vfhc72n96vjo        production_default                 overlay             swarm
w96csr2l8xfs        production_et_default              overlay             swarm
kzyinrocv44l        production_fm_default              overlay             swarm
ojzhjud10p4c        production_frontend_adapter        overlay             swarm
twqs12on7ljd        production_frontend_default        overlay             swarm
f0xsofyz3ijr        production_frontend_stitcher       overlay             swarm
hiejht78hb24        production_nm_default              overlay             swarm
tm1a4lil44fj        production_publish                 overlay             swarm
w5nkybfp0sdx        production_redirect-rts            overlay             swarm
z7m5vj07v4oh        production_rule_default            overlay             swarm
wkoazirudybz        production_sf-gateway              overlay             swarm
8474afrhz67q        production_st_default              overlay             swarm
7topfxx5tyqq        renderer-ssr-front_default         overlay             swarm
25wq5tlqw2lh        renderer-ssr-front_renderer        overlay             swarm
r1auaih8djsd        renderer-workers_renderer-worker   overlay             swarm
2sqrk8xikvp8        staging_backend_default            overlay             swarm
xlf3xkj77ozc        swarmpit_net                       overlay             swarm
r0dbnjrjs2se        viz_viz                            overlay             swarm

After couple of days of a running service, that gets allot of updates using stack deploy some of the hosts servers stops forwarding http requests from the external port of the service into the overlay network port

This is the service yml definition

version: '3.4'
services:
  backend-application:
    image: backend-application:latest
    ports:
      - "8104:6021"
    networks:
      - backend-application
    deploy:
      placement:
        constraints: [ engine.labels.prod == backend-large ]
      replicas: 2
      restart_policy:
        condition: any
        delay: 5s
      update_config:
        parallelism: 1
        failure_action: rollback
        delay: 30s
      resources:
        limits:
          cpus: '1'
          memory: 700M
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6021/healthcheck"]
      interval: 30s
      timeout: 15s
      retries: 3
      start_period: 90s

networks:
  backend-application:

This is the worker info:

Containers: 2
 Running: 2
 Paused: 0
 Stopped: 0
Images: 2
Server Version: 17.10.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: jsn6i06wij4dqnp9d66fhx2ws
 Is Manager: false
 Node Address: 172.19.34.55
 Manager Addresses:
  172.19.18.32:2377
  172.19.27.231:2377
  172.19.38.28:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 0351df1c5a66838d0c392b4ac4cf9450de844e2d
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-1022-aws
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.795GiB
Name: ip-172-19-34-55
ID: PI42:ZVCS:F56G:DRXW:6XQF:E6RZ:TIXR:AL4P:ODGK:EL3I:CFWQ:4IGA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 boost-prod=back-large
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

version

Client:
 Version:      17.10.0-ce
 API version:  1.33
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:04:16 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.10.0-ce
 API version:  1.33 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:02:56 2017
 OS/Arch:      linux/amd64
 Experimental: false

In the logs I dont see any wierd stuff going on only listing info node join events

the host listen to the published external port

~# netstat -ntulp | grep 8104
tcp6       0      0 :::8104                 :::*                    LISTEN      9125/dockerd

this is iptables

ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:8104
ACCEPT     tcp  --  anywhere             anywhere             state RELATED,ESTABLISHED tcp spt:8104

and yet if I am running on this host curl localhost:8104 it hangs. even after restarting docker damon

if I will restart networking and then docker daemon the daemon starts forwarding the requests from the external port to the overlay network port

Running on ubuntu 16.04 in AWS

I saved the iptables and ip’s before and after and if they can help I can attached them here.

dmesg shows:

[84558.441965] docker_gwbridge: port 2(vethe690172) entered disabled state
[84558.492357] veth486fd0c: renamed from veth0
[84558.566632] vetha0a1e33: renamed from eth0
[84558.616616] veth1cf86d4: renamed from eth0
[84558.654359] docker_gwbridge: port 4(vethd8418f4) entered disabled state
[84558.654536] veth0567f9c: renamed from eth1
[84558.684552] docker_gwbridge: port 4(vethd8418f4) entered disabled state
[84558.689344] device vethd8418f4 left promiscuous mode
[84558.689348] docker_gwbridge: port 4(vethd8418f4) entered disabled state
[84558.736830] docker_gwbridge: port 1(vethb263372) entered disabled state
[84558.736871] vethf3b536e: renamed from eth1
[84558.805134] docker_gwbridge: port 1(vethb263372) entered disabled state
[84558.808242] device vethb263372 left promiscuous mode
[84558.808245] docker_gwbridge: port 1(vethb263372) entered disabled state
[84567.699638] IPVS: Creating netns size=2192 id=11
[84567.761658] IPVS: Creating netns size=2192 id=12
[84567.798329] br0: renamed from ov-001000-fgsk9
[84567.820357] vxlan0: renamed from vx-001000-fgsk9
[84567.836310] device vxlan0 entered promiscuous mode
[84567.836462] br0: port 1(vxlan0) entered forwarding state
[84567.836468] br0: port 1(vxlan0) entered forwarding state
[84567.885062] veth0: renamed from vethd47eb8b
[84567.900320] device veth0 entered promiscuous mode
[84567.900441] br0: port 2(veth0) entered forwarding state
[84567.900445] br0: port 2(veth0) entered forwarding state
[84567.980442] eth0: renamed from veth7b82c66
[84568.018716] device veth14fc3a1 entered promiscuous mode
[84568.018770] IPv6: ADDRCONF(NETDEV_UP): veth14fc3a1: link is not ready
[84568.018775] docker_gwbridge: port 1(veth14fc3a1) entered forwarding state
[84568.018781] docker_gwbridge: port 1(veth14fc3a1) entered forwarding state
[84568.018820] docker_gwbridge: port 1(veth14fc3a1) entered disabled state
[84568.032556] eth1: renamed from vethfcbc975
[84568.044463] IPv6: ADDRCONF(NETDEV_CHANGE): veth14fc3a1: link becomes ready
[84568.044491] docker_gwbridge: port 1(veth14fc3a1) entered forwarding state
[84568.044497] docker_gwbridge: port 1(veth14fc3a1) entered forwarding state
[84582.876068] br0: port 1(vxlan0) entered forwarding state
[84582.940065] br0: port 2(veth0) entered forwarding state
[84583.068076] docker_gwbridge: port 1(veth14fc3a1) entered forwarding state
[84594.298784] IPVS: Creating netns size=2192 id=13
[84594.340829] IPVS: Creating netns size=2192 id=14
[84594.382132] br0: renamed from ov-001001-6upju
[84594.412418] vxlan0: renamed from vx-001001-6upju
[84594.428327] device vxlan0 entered promiscuous mode
[84594.428496] br0: port 1(vxlan0) entered forwarding state
[84594.428501] br0: port 1(vxlan0) entered forwarding state
[84594.480585] veth0: renamed from vethc3a2eff
[84594.500273] device veth0 entered promiscuous mode
[84594.500403] br0: port 2(veth0) entered forwarding state
[84594.500408] br0: port 2(veth0) entered forwarding state
[84594.576362] eth0: renamed from veth8ee96c8
[84594.632612] device veth07bd5bb entered promiscuous mode
[84594.632677] IPv6: ADDRCONF(NETDEV_UP): veth07bd5bb: link is not ready
[84594.632682] docker_gwbridge: port 2(veth07bd5bb) entered forwarding state
[84594.632689] docker_gwbridge: port 2(veth07bd5bb) entered forwarding state

I found an old issue https://github.com/moby/moby/issues/20716 that talks on a similar behaviour. This is also similar https://github.com/moby/moby/issues/35807

And why the docker veth left promiscuous mode and entered a disabled state? I dont see anything in the daemon logs

About this issue

Original URL
State: open
Created 7 years ago
Reactions: 11
Comments: 24 (2 by maintainers)

Most upvoted comments

We are using 18.02 and see this same error and this on all nodes after some time.

Feb 16 23:47:56 worker-node-1 kernel: ov-001007-mmis3: port 2(veth0) entered disabled state
Feb 16 23:47:56 worker-node-1 kernel: device vxlan0 left promiscuous mode
Feb 16 23:47:56 worker-node-1 kernel: ov-001007-mmis3: port 1(vxlan0) entered disabled state
Feb 16 23:47:56 worker-node-1 kernel: vx-001007-mmis3: renamed from vxlan0
Feb 16 23:47:56 worker-node-1 kernel: IPVS: __ip_vs_del_service: enter
Feb 16 23:47:56 worker-node-1 kernel: veth65b588e: renamed from veth0
Feb 16 23:47:56 worker-node-1 systemd-udevd[12844]: Could not generate persistent MAC address for vx-001007-mmis3: No such file or directory
Feb 16 23:47:56 worker-node-1 systemd-udevd[12868]: Could not generate persistent MAC address for veth65b588e: No such file or directory
Feb 16 23:47:56 worker-node-1 kernel: vethc199ed9: renamed from eth0
Feb 16 23:47:56 worker-node-1 kernel: docker_gwbridge: port 6(veth256ae6a) entered disabled state
Feb 16 23:47:56 worker-node-1 kernel: veth8138436: renamed from eth1
Feb 16 23:47:56 worker-node-1 dockerd[19862]: time="2018-02-16T23:47:56.831217854+01:00" level=error msg="fatal task error" error="create cr-test_registrydb-data: driver 'local' already has volume 'cr-t
Feb 16 23:47:56 worker-node-1 kernel: docker_gwbridge: port 6(veth256ae6a) entered disabled state
Feb 16 23:47:56 worker-node-1 kernel: device veth256ae6a left promiscuous mode
Feb 16 23:47:56 worker-node-1 kernel: docker_gwbridge: port 6(veth256ae6a) entered disabled state
Feb 16 23:47:57 worker-node-1 dockerd[19862]: time="2018-02-16T23:47:57.095091663+01:00" level=error msg="fatal task error" error="task: non-zero exit (1)" module=node/agent/taskmanager node.id=lkyigdl6
Feb 16 23:47:57 worker-node-1 dockerd[19862]: time="2018-02-16T23:47:57.891918819+01:00" level=warning msg="failed to deactivate service binding for container cr-test_helmsmandb.1.0jwypzxencnn6sd149uo85
Feb 16 23:47:58 worker-node-1 dockerd[19862]: time="2018-02-16T23:47:58.351596538+01:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limi
Feb 16 23:47:59 worker-node-1 dockerd[19862]: time="2018-02-16T23:47:59.623124908+01:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limi
Feb 16 23:47:59 worker-node-1 dockerd[19862]: time="2018-02-16T23:47:59.894022393+01:00" level=error msg="failed removing service binding for 9b830cb573855cfb68d63fe21a1a153755f39aa799270f110f06ead3ac6a
Feb 16 23:48:02 worker-node-1 dockerd[19862]: time="2018-02-16T23:48:02+01:00" level=error msg="time=\"2018-02-16T22:48:02Z\" level=error msg=\"resource not found\" host=\"unix:///var/run/rexray/0848540
Feb 16 23:48:02 worker-node-1 dockerd[19862]: time="2018-02-16T23:48:02+01:00" level=error msg="time=\"2018-02-16T22:48:02Z\" level=error msg=\"error: api call failed\" error.resourceID=\"cr-test_pg-dat
Feb 16 23:48:02 worker-node-1 dockerd[19862]: time="2018-02-16T23:48:02+01:00" level=error msg="time=\"2018-02-16T22:48:02Z\" level=error msg=\"docker-legacy: Mount: cr-test_pg-data: failed: resource no
Feb 16 23:48:02 worker-node-1 dockerd[19862]: time="2018-02-16T23:48:02.035505346+01:00" level=error msg="fatal task error" error="VolumeDriver.Mount: docker-legacy: Mount: cr-test_pg-data: failed: reso
Feb 16 23:48:02 worker-node-1 dockerd[19862]: time="2018-02-16T23:48:02.362927140+01:00" level=error msg="19d59b1a1fc67e81f275fec2dc9021e66043e9ade2479168c95f8b189156a329 cleanup: failed to delete conta
Feb 16 23:48:02 worker-node-1 kernel: IPVS: Creating netns size=2192 id=35695
Feb 16 23:48:02 worker-node-1 kernel: br0: renamed from ov-001007-mmis3
Feb 16 23:48:02 worker-node-1 systemd-udevd[13042]: Could not generate persistent MAC address for vx-001007-mmis3: No such file or directory
Feb 16 23:48:02 worker-node-1 kernel: vxlan0: renamed from vx-001007-mmis3
Feb 16 23:48:03 worker-node-1 dockerd[19862]: time="2018-02-16T23:48:02.985809162+01:00" level=warning msg="failed to deactivate service binding for container cr-test_helmsmandb.1.nm2shzmogt75o5um2dbuw4
Feb 16 23:48:03 worker-node-1 dockerd[19862]: time="2018-02-16T23:48:02.985911506+01:00" level=warning msg="failed to deactivate service binding for container cr-test_postgresdb.1.nta9ddv1knowkirx8v1d90
Feb 16 23:48:03 worker-node-1 kernel: device vxlan0 entered promiscuous mode
Feb 16 23:48:03 worker-node-1 kernel: br0: port 1(vxlan0) entered forwarding state
Feb 16 23:48:03 worker-node-1 kernel: br0: port 1(vxlan0) entered forwarding state
Feb 16 23:48:03 worker-node-1 systemd-udevd[13082]: Could not generate persistent MAC address for vethd5841d7: No such file or directory
Feb 16 23:48:03 worker-node-1 kernel: veth0: renamed from vethaa8d546
Feb 16 23:48:03 worker-node-1 kernel: device veth0 entered promiscuous mode
Feb 16 23:48:03 worker-node-1 kernel: br0: port 2(veth0) entered forwarding state
Feb 16 23:48:03 worker-node-1 kernel: br0: port 2(veth0) entered forwarding state
Feb 16 23:48:03 worker-node-1 kernel: device veth42f9901 entered promiscuous mode
Feb 16 23:48:03 worker-node-1 kernel: IPv6: ADDRCONF(NETDEV_UP): veth42f9901: link is not ready
Feb 16 23:48:03 worker-node-1 kernel: docker_gwbridge: port 6(veth42f9901) entered forwarding state
Feb 16 23:48:03 worker-node-1 kernel: docker_gwbridge: port 6(veth42f9901) entered forwarding state
Feb 16 23:48:03 worker-node-1 systemd-udevd[13109]: Could not generate persistent MAC address for veth53bda75: No such file or directory
Feb 16 23:48:03 worker-node-1 systemd-udevd[13110]: Could not generate persistent MAC address for veth42f9901: No such file or directory
Feb 16 23:48:03 worker-node-1 dockerd[19862]: time="2018-02-16T23:48:03+01:00" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/19d59b1a1fc67e81f275fec2dc9021e66043e9a
Feb 16 23:48:03 worker-node-1 systemd-udevd[13210]: Could not generate persistent MAC address for veth6280182: No such file or directory
Feb 16 23:48:03 worker-node-1 systemd-udevd[13208]: Could not generate persistent MAC address for veth9e0c25a: No such file or directory

+14

Vad1mo on Feb 16, 2018

Still happens with 20.10.2 almost daily. We are using Swarm. Maybe that gives a hint.

huepf on Jan 25, 2021

Any update or workaround?

gokhansari on Aug 28, 2020

Any updates on this? I am having a similar issue with ubuntu 20.04 and Docker 20.10.17 with Swarm 😔

Valgueiro on Jul 27, 2023

Same problem on Ubuntu 20.04 and Docker 20.10.15 with Swarm. When restarting/redeploying services, most of the time everything works fine but randomly (maybe 1 out of 10 times) our web app stops getting requests. We checked firewall settings, DNS caches, log files, port mappings and everything is normal.

SinanMujan on Feb 10, 2023

This is happening on centos 9. Latest docker version. Has been happening everyday at same time. Need fix…

PerkinTahmaz on Mar 1, 2022

The issue yes, but it’s reproducible on Docker Engine 19.03.6, with Ubuntu 16.04.6 LTS Definitely seems like a Docker issue (Port open, no firewall, sudo netstat -ntlp shows docker-proxy bound to all addresses for that port for both tcp4 and tcp6, etc.

We noticed on first deploy it works fine, it’s only on subsequent docker-compose up -d where there are many services (as noted above) where we run into the issue. Worst is that it’s intermittent. Some containers are fine and some hang on CURL

dgilling on Feb 24, 2020

Same problem with 19.03.02 after starting a bigger amount of new services in a short time frame (around 40 services within 1 minute). Unfortunately, not more information present at the moment for reproduction of the issue.

cost6 on Sep 28, 2019