rancher: Rancher 2.4.2 - Windows Cluster using flannel host-gw not working

What kind of request is this (question/bug/enhancement/feature request): Bug

Steps to reproduce (least amount of steps as possible):

  1. Create a Windows Cluster using flannel / host-gw

Result: The pod cattle-node-agent-windows continuously crashes:

WARN: Default docker named pipe is not found
WARN: Please bind mount in the docker named pipe to //./pipe/docker_engine if docker errors occur
WARN: example: docker run -v //./pipe/custom_docker_named_pipe://./pipe/docker_engine ...
FATA: https://rancher.xxx.com is not accessible

Other details that may be helpful: exec’ing into the pod prior to it crashing and trying to use curl results in timeouts for all addresses. curl from the kubelet and/or other pods works fine.

Setting up a cluster using flannel / vxlan works perfectly fine.

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.4.2
  • Installation option (single install/HA): HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom
  • Machine type (cloud/VM/metal) and specifications (CPU/memory): Bare metal, 2 CPU cores, 8 GB RAM
  • Kubernetes version (use kubectl version):
(Switched to 1.15.11 to see if it'll work on that version. Have also tried 1.16.9)

lient Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.11", GitCommit:"d94a81c724ea8e1ccc9002d89b7fe81d58f89ede", GitTreeState:"clean", BuildDate:"2020-03-12T21:00:06Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

  • Docker version (use docker version):
(This is on the windows node)

Client: Docker Engine - Enterprise
 Version:           19.03.5
 API version:       1.40
 Go version:        go1.12.12
 Git commit:        2ee0c57608
 Built:             11/13/2019 08:00:16
 OS/Arch:           windows/amd64
 Experimental:      false

Server: Docker Engine - Enterprise
 Engine:
  Version:          19.03.5
  API version:      1.40 (minimum version 1.24)
  Go version:       go1.12.12
  Git commit:       2ee0c57608
  Built:            11/13/2019 07:58:51
  OS/Arch:          windows/amd64
  Experimental:     false

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 32

Most upvoted comments

It is also stated by a rancher employee:

https://forums.rancher.com/t/mixed-linux-windows-workers/15699/9

The Linux machines in a Windows cluster are required to run etcd and various other system services (for which no Windows build exists). Otherwise from the user perspective we treat it as a “Windows-only” cluster.

So yes, while you can use tolerstions to bypass it, you are not meant to do that.

This sounds ridiculous. There is no response as to why this is any my company is not planning on following this.

@maxisam what I’ve discovered is that it will work if your master nodes are on a different VLAN. Can’t tell you why though.

@maxisam We are close. We have 3 clusters stood up and are planning to do a load test over the next week or two to prove out that Kubernetes and Windows containers can handle our load.

We are running 1809 in all our clusters.

17763.1457 is the cluster is broken due to the Windows update. 17763.1282 is the cluster that is working.

VM Ware Tools v11.1.0

The issue is most likely a Windows update. That is what I found. Everything was working until the latest Windows updates were applied. Then the Windows agent started crashing. Uninstalling the Windows update fixed the issue.

Here are two that are applying the same “fix” but end up breaking connectivity…

https://support.microsoft.com/en-us/help/4571748/windows-10-update-kb4571748 https://support.microsoft.com/en-us/help/4570333/windows-10-update-kb4570333

Hi @AMoghrabi

Have you been able to troubleshoot this problem? I’m having a similar problem using Flannel/VXLan, my setup using Windows HOst on VMWARE.

Best,

Giang

Get-HnsNetwork | Select-Object -Property Name

Name
----
cbr0
nat