moby: Windows Server 2019 host in hybrid Swarm cluster breaks routing mesh traffic to Linux services

Description

Swarm routing mesh stops working (timeouts / not responding) if traffic to a Linux service goes through a Windows Server 2019 machine in a hybrid cluster, works as expected with Windows Server 2016 1803.

Steps to reproduce the issue:

  1. Create a Linux and a Windows Server 2019 host on the same subnet:
    1. Ubuntu 18.04 host, update, disable firewall, install Docker.
    2. Windows Server 2019 host, update, disable firewall, install Docker.
  2. On the linux host, initialize new swarm with: sudo docker swarm init
  3. On the windows host, join the swarm.
  4. On the linux host, create service: sudo docker service create --name test0 -p 9000:80 nginx
  5. From another host, continuously do requests to the service on the linux host with for example: while($true) { curl -I http://linux:9000; Get-Date; sleep -Seconds 1 }. Verify that you get HTTP 200.
  6. From another host, then do a single request to the service on the windows host: curl -Isv http://windows:9000

Log files from both hosts (with debugging enabled) can be found here:

Describe the results you received:

You will see that the continuous requests starts to hang and timeout, until a while after the single request also times out. The continuous requests will recover after 20-30 seconds and once again return HTTP 200.

Describe the results you expected:

Expecting that requests through any host in the cluster finishes without any hang/timeouts as it would if the windows host was running Windows Server 2016 1803.

Additional information you deem important (e.g. issue happens only occasionally):

This used to work using Windows Server 2016 1803. I have multiple hybrid Docker Swarm clusters running in production with no issues at all.

Other observations:

  1. Does not matter if linux host is Ubuntu 16, 18 or Debian 9.
  2. Does not matter if linux host is worker and windows Server 2019 host is master.
  3. Doing the same on either 2 linux host or 2 Windows Server 2019 hosts works fine.
  4. Same issue if downgrading the Docker engine on the windows host to 18.03.1-ee-2.
  5. Windows Server 2019 SKU does not matter, reproduced on datacenter (full and core) and standard edition (full).

Linux: Output of docker version:

Client:
 Version:           18.09.0
 API version:       1.39
 Go version:        go1.10.4
 Git commit:        4d60db4
 Built:             Wed Nov  7 00:49:01 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       4d60db4
  Built:            Wed Nov  7 00:16:44 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Linux: Output of docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 18.09.0
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
 NodeID: 6m8os6plt3f23t24ngxegaqgx
 Is Manager: true
 ClusterID: jkw54ivjwqbmm6tmkk4h0p04u
 Managers: 1
 Nodes: 2
 Default Address Pool: 10.0.0.0/8
 SubnetSize: 24
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 172.17.1.10
 Manager Addresses:
  172.17.1.10:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: c4446665cb9c30056f4998ed953e6d4ff22c7c39
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-1036-azure
Operating System: Ubuntu 18.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.853GiB
Name: dc2node-nvm0
ID: BYOD:7LOK:KOXR:W4QQ:UPDG:YCNZ:H5EN:RS74:AKRI:F5H4:2NTX:FKWC
Docker Root Dir: /datadisk
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 49
 Goroutines: 174
 System Time: 2019-01-03T12:24:38.998562887Z
 EventsListeners: 1
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 registry.valtech.dk
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

Windows: Output of docker version:

Client:
 Version:           18.09.0
 API version:       1.39
 Go version:        go1.10.3
 Git commit:        33a45cd0a2
 Built:             unknown-buildtime
 OS/Arch:           windows/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.24)
  Go version:       go1.10.3
  Git commit:       33a45cd0a2
  Built:            11/07/2018 00:24:12
  OS/Arch:          windows/amd64
  Experimental:     false

Windows: Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 18.09.0
Storage Driver: windowsfilter
 Windows:
Logging Driver: json-file
Plugins:
 Volume: local
 Network: ics l2bridge l2tunnel nat null overlay transparent
 Log: awslogs etwlogs fluentd gelf json-file local logentries splunk syslog
Swarm: active
 NodeID: ych43rb08jsey2s7n4p738t12
 Is Manager: false
 Node Address: 172.17.1.20
 Manager Addresses:
  172.17.1.10:2377
Default Isolation: process
Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
Operating System: Windows Server 2019 Datacenter Version 1809 (OS Build 17763.195)
OSType: windows
Architecture: x86_64
CPUs: 4
Total Memory: 8GiB
Name: dc2node-wvm0
ID: QKH5:ZHTV:NP2P:TTNM:LV22:35MZ:NXPZ:CPQL:FN2L:ZJ2V:S22F:AMRS
Docker Root Dir: F:\docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: -1
 Goroutines: 80
 System Time: 2019-01-03T12:26:18.2309755Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: this node is not a swarm manager - check license status on a manager node

Additional environment details (AWS, VirtualBox, physical, etc.):

Reproduced on VM’s running on Azure and also on my workstation on VM’s running in Hyper-V.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 16 (9 by maintainers)

Most upvoted comments

@pbering Thank you for the concise repro steps. We were able to repro and fix the issue. It should be in the next patch. Will also add more tests to catch this.

It has not yet released. My understanding is next week.

Sorry for the late response, I was trying to get the KB number. It will be released as a patch. I haven’t found the KB number yet.

Investigating