moby: Windows Server 2019 host in hybrid Swarm cluster breaks routing mesh traffic to Linux services
Description
Swarm routing mesh stops working (timeouts / not responding) if traffic to a Linux service goes through a Windows Server 2019 machine in a hybrid cluster, works as expected with Windows Server 2016 1803.
Steps to reproduce the issue:
- Create a Linux and a Windows Server 2019 host on the same subnet:
- Ubuntu 18.04 host, update, disable firewall, install Docker.
- Windows Server 2019 host, update, disable firewall, install Docker.
- On the
linux
host, initialize new swarm with:sudo docker swarm init
- On the
windows
host, join the swarm. - On the
linux
host, create service:sudo docker service create --name test0 -p 9000:80 nginx
- From another host, continuously do requests to the service on the
linux
host with for example:while($true) { curl -I http://linux:9000; Get-Date; sleep -Seconds 1 }
. Verify that you get HTTP 200. - From another host, then do a single request to the service on the
windows
host:curl -Isv http://windows:9000
Log files from both hosts (with debugging enabled) can be found here:
- Linux: https://gist.github.com/pbering/bae7d2cb2d5503f62e8e230058e44917
- Windows: https://gist.github.com/pbering/fac03797c4205691fd8ce34075cf7136
Describe the results you received:
You will see that the continuous requests starts to hang and timeout, until a while after the single request also times out. The continuous requests will recover after 20-30 seconds and once again return HTTP 200.
Describe the results you expected:
Expecting that requests through any host in the cluster finishes without any hang/timeouts as it would if the windows
host was running Windows Server 2016 1803.
Additional information you deem important (e.g. issue happens only occasionally):
This used to work using Windows Server 2016 1803. I have multiple hybrid Docker Swarm clusters running in production with no issues at all.
Other observations:
- Does not matter if
linux
host is Ubuntu 16, 18 or Debian 9. - Does not matter if
linux
host is worker andwindows
Server 2019 host is master. - Doing the same on either 2 linux host or 2 Windows Server 2019 hosts works fine.
- Same issue if downgrading the Docker engine on the
windows
host to 18.03.1-ee-2. - Windows Server 2019 SKU does not matter, reproduced on datacenter (full and core) and standard edition (full).
Linux: Output of docker version
:
Client:
Version: 18.09.0
API version: 1.39
Go version: go1.10.4
Git commit: 4d60db4
Built: Wed Nov 7 00:49:01 2018
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.0
API version: 1.39 (minimum version 1.12)
Go version: go1.10.4
Git commit: 4d60db4
Built: Wed Nov 7 00:16:44 2018
OS/Arch: linux/amd64
Experimental: false
Linux: Output of docker info
:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 1
Server Version: 18.09.0
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: 6m8os6plt3f23t24ngxegaqgx
Is Manager: true
ClusterID: jkw54ivjwqbmm6tmkk4h0p04u
Managers: 1
Nodes: 2
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 172.17.1.10
Manager Addresses:
172.17.1.10:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: c4446665cb9c30056f4998ed953e6d4ff22c7c39
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.15.0-1036-azure
Operating System: Ubuntu 18.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.853GiB
Name: dc2node-nvm0
ID: BYOD:7LOK:KOXR:W4QQ:UPDG:YCNZ:H5EN:RS74:AKRI:F5H4:2NTX:FKWC
Docker Root Dir: /datadisk
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 49
Goroutines: 174
System Time: 2019-01-03T12:24:38.998562887Z
EventsListeners: 1
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
registry.valtech.dk
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
WARNING: No swap limit support
Windows: Output of docker version
:
Client:
Version: 18.09.0
API version: 1.39
Go version: go1.10.3
Git commit: 33a45cd0a2
Built: unknown-buildtime
OS/Arch: windows/amd64
Experimental: false
Server:
Engine:
Version: 18.09.0
API version: 1.39 (minimum version 1.24)
Go version: go1.10.3
Git commit: 33a45cd0a2
Built: 11/07/2018 00:24:12
OS/Arch: windows/amd64
Experimental: false
Windows: Output of docker info
:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 18.09.0
Storage Driver: windowsfilter
Windows:
Logging Driver: json-file
Plugins:
Volume: local
Network: ics l2bridge l2tunnel nat null overlay transparent
Log: awslogs etwlogs fluentd gelf json-file local logentries splunk syslog
Swarm: active
NodeID: ych43rb08jsey2s7n4p738t12
Is Manager: false
Node Address: 172.17.1.20
Manager Addresses:
172.17.1.10:2377
Default Isolation: process
Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
Operating System: Windows Server 2019 Datacenter Version 1809 (OS Build 17763.195)
OSType: windows
Architecture: x86_64
CPUs: 4
Total Memory: 8GiB
Name: dc2node-wvm0
ID: QKH5:ZHTV:NP2P:TTNM:LV22:35MZ:NXPZ:CPQL:FN2L:ZJ2V:S22F:AMRS
Docker Root Dir: F:\docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: -1
Goroutines: 80
System Time: 2019-01-03T12:26:18.2309755Z
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: this node is not a swarm manager - check license status on a manager node
Additional environment details (AWS, VirtualBox, physical, etc.):
Reproduced on VM’s running on Azure and also on my workstation on VM’s running in Hyper-V.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16 (9 by maintainers)
@pbering Thank you for the concise repro steps. We were able to repro and fix the issue. It should be in the next patch. Will also add more tests to catch this.
It has not yet released. My understanding is next week.
Sorry for the late response, I was trying to get the KB number. It will be released as a patch. I haven’t found the KB number yet.
Investigating