moby: Docker host stops forwarding requests into the overlay network of a service from external port
I have a swarm cluster 17.10.0-ce with allot of overlay networks configured on it.
NETWORK ID NAME DRIVER SCOPE
vfy20700nst4 ashburn_default overlay swarm
ofbwxri81ksu automationhub_default overlay swarm
89ua249d1vyk bambrain_bam overlay swarm
xecb7jjdpxgk bambrain_default overlay swarm
1sjorpddaull bamsite_bam overlay swarm
vhugnem8ji74 bamsite_default overlay swarm
i6lqqsiand8b bamweb_bam overlay swarm
ttkudolbfqay bamweb_default overlay swarm
119408fb04ca bridge bridge local
mtc1mw0wfk1y data-science-research_default overlay swarm
pkxgh1zrbqwm data-science_default overlay swarm
a7d5a27b1c4e docker_gwbridge bridge local
08b614c05eed host host local
fgsk9gwbuoxa ingress overlay swarm
6upju39rz8u3 logzio overlay swarm
qcchirkkt2v6 nibam_default overlay swarm
a066293749a2 none null local
oqds535acu47 production_backend_default overlay swarm
w5ondcslnzau production_bb_blocks overlay swarm
fv19qc0hc3uo production_bst_boost overlay swarm
qrqk3lo455f6 production_bst_default overlay swarm
fk2rbi1acydj production_cirrus overlay swarm
2nfpyslcjaza production_cms overlay swarm
ro6iwgwk83a8 production_data-validation overlay swarm
vfhc72n96vjo production_default overlay swarm
w96csr2l8xfs production_et_default overlay swarm
kzyinrocv44l production_fm_default overlay swarm
ojzhjud10p4c production_frontend_adapter overlay swarm
twqs12on7ljd production_frontend_default overlay swarm
f0xsofyz3ijr production_frontend_stitcher overlay swarm
hiejht78hb24 production_nm_default overlay swarm
tm1a4lil44fj production_publish overlay swarm
w5nkybfp0sdx production_redirect-rts overlay swarm
z7m5vj07v4oh production_rule_default overlay swarm
wkoazirudybz production_sf-gateway overlay swarm
8474afrhz67q production_st_default overlay swarm
7topfxx5tyqq renderer-ssr-front_default overlay swarm
25wq5tlqw2lh renderer-ssr-front_renderer overlay swarm
r1auaih8djsd renderer-workers_renderer-worker overlay swarm
2sqrk8xikvp8 staging_backend_default overlay swarm
xlf3xkj77ozc swarmpit_net overlay swarm
r0dbnjrjs2se viz_viz overlay swarm
After couple of days of a running service, that gets allot of updates using stack deploy some of the hosts servers stops forwarding http requests from the external port of the service into the overlay network port
This is the service yml definition
version: '3.4'
services:
backend-application:
image: backend-application:latest
ports:
- "8104:6021"
networks:
- backend-application
deploy:
placement:
constraints: [ engine.labels.prod == backend-large ]
replicas: 2
restart_policy:
condition: any
delay: 5s
update_config:
parallelism: 1
failure_action: rollback
delay: 30s
resources:
limits:
cpus: '1'
memory: 700M
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:6021/healthcheck"]
interval: 30s
timeout: 15s
retries: 3
start_period: 90s
networks:
backend-application:
This is the worker info:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 2
Server Version: 17.10.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: jsn6i06wij4dqnp9d66fhx2ws
Is Manager: false
Node Address: 172.19.34.55
Manager Addresses:
172.19.18.32:2377
172.19.27.231:2377
172.19.38.28:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 0351df1c5a66838d0c392b4ac4cf9450de844e2d
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-1022-aws
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.795GiB
Name: ip-172-19-34-55
ID: PI42:ZVCS:F56G:DRXW:6XQF:E6RZ:TIXR:AL4P:ODGK:EL3I:CFWQ:4IGA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
boost-prod=back-large
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
version
Client:
Version: 17.10.0-ce
API version: 1.33
Go version: go1.8.3
Git commit: f4ffd25
Built: Tue Oct 17 19:04:16 2017
OS/Arch: linux/amd64
Server:
Version: 17.10.0-ce
API version: 1.33 (minimum version 1.12)
Go version: go1.8.3
Git commit: f4ffd25
Built: Tue Oct 17 19:02:56 2017
OS/Arch: linux/amd64
Experimental: false
In the logs I dont see any wierd stuff going on only listing info node join events
the host listen to the published external port
~# netstat -ntulp | grep 8104
tcp6 0 0 :::8104 :::* LISTEN 9125/dockerd
this is iptables
ACCEPT tcp -- anywhere anywhere tcp dpt:8104
ACCEPT tcp -- anywhere anywhere state RELATED,ESTABLISHED tcp spt:8104
and yet if I am running on this host curl localhost:8104 it hangs. even after restarting docker damon
if I will restart networking and then docker daemon the daemon starts forwarding the requests from the external port to the overlay network port
Running on ubuntu 16.04 in AWS
I saved the iptables and ip’s before and after and if they can help I can attached them here.
dmesg shows:
[84558.441965] docker_gwbridge: port 2(vethe690172) entered disabled state
[84558.492357] veth486fd0c: renamed from veth0
[84558.566632] vetha0a1e33: renamed from eth0
[84558.616616] veth1cf86d4: renamed from eth0
[84558.654359] docker_gwbridge: port 4(vethd8418f4) entered disabled state
[84558.654536] veth0567f9c: renamed from eth1
[84558.684552] docker_gwbridge: port 4(vethd8418f4) entered disabled state
[84558.689344] device vethd8418f4 left promiscuous mode
[84558.689348] docker_gwbridge: port 4(vethd8418f4) entered disabled state
[84558.736830] docker_gwbridge: port 1(vethb263372) entered disabled state
[84558.736871] vethf3b536e: renamed from eth1
[84558.805134] docker_gwbridge: port 1(vethb263372) entered disabled state
[84558.808242] device vethb263372 left promiscuous mode
[84558.808245] docker_gwbridge: port 1(vethb263372) entered disabled state
[84567.699638] IPVS: Creating netns size=2192 id=11
[84567.761658] IPVS: Creating netns size=2192 id=12
[84567.798329] br0: renamed from ov-001000-fgsk9
[84567.820357] vxlan0: renamed from vx-001000-fgsk9
[84567.836310] device vxlan0 entered promiscuous mode
[84567.836462] br0: port 1(vxlan0) entered forwarding state
[84567.836468] br0: port 1(vxlan0) entered forwarding state
[84567.885062] veth0: renamed from vethd47eb8b
[84567.900320] device veth0 entered promiscuous mode
[84567.900441] br0: port 2(veth0) entered forwarding state
[84567.900445] br0: port 2(veth0) entered forwarding state
[84567.980442] eth0: renamed from veth7b82c66
[84568.018716] device veth14fc3a1 entered promiscuous mode
[84568.018770] IPv6: ADDRCONF(NETDEV_UP): veth14fc3a1: link is not ready
[84568.018775] docker_gwbridge: port 1(veth14fc3a1) entered forwarding state
[84568.018781] docker_gwbridge: port 1(veth14fc3a1) entered forwarding state
[84568.018820] docker_gwbridge: port 1(veth14fc3a1) entered disabled state
[84568.032556] eth1: renamed from vethfcbc975
[84568.044463] IPv6: ADDRCONF(NETDEV_CHANGE): veth14fc3a1: link becomes ready
[84568.044491] docker_gwbridge: port 1(veth14fc3a1) entered forwarding state
[84568.044497] docker_gwbridge: port 1(veth14fc3a1) entered forwarding state
[84582.876068] br0: port 1(vxlan0) entered forwarding state
[84582.940065] br0: port 2(veth0) entered forwarding state
[84583.068076] docker_gwbridge: port 1(veth14fc3a1) entered forwarding state
[84594.298784] IPVS: Creating netns size=2192 id=13
[84594.340829] IPVS: Creating netns size=2192 id=14
[84594.382132] br0: renamed from ov-001001-6upju
[84594.412418] vxlan0: renamed from vx-001001-6upju
[84594.428327] device vxlan0 entered promiscuous mode
[84594.428496] br0: port 1(vxlan0) entered forwarding state
[84594.428501] br0: port 1(vxlan0) entered forwarding state
[84594.480585] veth0: renamed from vethc3a2eff
[84594.500273] device veth0 entered promiscuous mode
[84594.500403] br0: port 2(veth0) entered forwarding state
[84594.500408] br0: port 2(veth0) entered forwarding state
[84594.576362] eth0: renamed from veth8ee96c8
[84594.632612] device veth07bd5bb entered promiscuous mode
[84594.632677] IPv6: ADDRCONF(NETDEV_UP): veth07bd5bb: link is not ready
[84594.632682] docker_gwbridge: port 2(veth07bd5bb) entered forwarding state
[84594.632689] docker_gwbridge: port 2(veth07bd5bb) entered forwarding state
I found an old issue https://github.com/moby/moby/issues/20716 that talks on a similar behaviour. This is also similar https://github.com/moby/moby/issues/35807
And why the docker veth left promiscuous mode and entered a disabled state? I dont see anything in the daemon logs
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 11
- Comments: 24 (2 by maintainers)
We are using 18.02 and see this same error and this on all nodes after some time.
Still happens with 20.10.2 almost daily. We are using Swarm. Maybe that gives a hint.
Any update or workaround?
Any updates on this? I am having a similar issue with ubuntu 20.04 and Docker 20.10.17 with Swarm 😔
Same problem on Ubuntu 20.04 and Docker 20.10.15 with Swarm. When restarting/redeploying services, most of the time everything works fine but randomly (maybe 1 out of 10 times) our web app stops getting requests. We checked firewall settings, DNS caches, log files, port mappings and everything is normal.
This is happening on centos 9. Latest docker version. Has been happening everyday at same time. Need fix…
The issue yes, but it’s reproducible on Docker Engine 19.03.6, with Ubuntu 16.04.6 LTS Definitely seems like a Docker issue (Port open, no firewall,
sudo netstat -ntlp
shows docker-proxy bound to all addresses for that port for both tcp4 and tcp6, etc.We noticed on first deploy it works fine, it’s only on subsequent docker-compose up -d where there are many services (as noted above) where we run into the issue. Worst is that it’s intermittent. Some containers are fine and some hang on CURL
Same problem with 19.03.02 after starting a bigger amount of new services in a short time frame (around 40 services within 1 minute). Unfortunately, not more information present at the moment for reproduction of the issue.