rancher: Monitoring does not work for Windows Server Core worker nodes

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

Create new Windows Server Core VM
Add the windows server core VM to Rancher as a worker node
enable cluster monitoring (or have it already enabled in the cluster prior to joining)

Result: exporter-node-binary-copy container fails to deploy with encountered an error during Start: failure in a Windows system call: The virtual machine or container exited unexpectedly. (0xc0370106). This prevents the exporter-node-windows-cluster-monitoring workload to finish deploying on the node.

Other details that may be helpful:

The same containers deploy and work without an issue on Windows Server (with Desktop experience, aka normal Windows Server) and appears isolated to Windows Server Core.

W0713 15:16:42.243971 7340 docker_container.go:238] Cannot create symbolic link because container log file doesn’t exist!

E0713 15:16:42.243971 7340 remote_runtime.go:222] StartContainer “118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693” from runtime service failed: rpc error: code = Unknown desc = failed to start container “118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693”: Error response from daemon: container 118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693 encountered an error during Start: failure in a Windows system call: The virtual machine or container exited unexpectedly. (0xc0370106) E0713 15:16:42.245009 7340 kuberuntime_manager.go:801] init container start failed: RunContainerError: failed to start container “118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693”: Error response from daemon: container 118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693 encountered an error during Start: failure in a Windows system call: The virtual machine or container exited unexpectedly. (0xc0370106)

E0713 15:16:42.245009 7340 pod_workers.go:191] Error syncing pod 52aacfee-2aab-4321-aad6-96ca0feaffc9 (“exporter-node-windows-cluster-monitoring-7lc6l_cattle-prometheus(52aacfee-2aab-4321-aad6-96ca0feaffc9)”), skipping: failed to “StartContainer” for “exporter-node-binary-copy” with RunContainerError: “failed to start container "118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693": Error response from daemon: container 118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693 encountered an error during Start: failure in a Windows system call: The virtual machine or container exited unexpectedly. (0xc0370106)”

StartTime:2020-07-10 15:23:27 +0000 GMT,ContainerStatuses:[]ContainerStatus{ContainerStatus{Name:exporter-node,State:ContainerState{Waiting:&ContainerStateWaiting{Reason:PodInitializing,Message:,},Running:nil,Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:rancher/wmi_exporter-package:v0.0.3,ImageID:,ContainerID:,Started:*false,},},QOSClass:BestEffort,InitContainerStatuses:[]ContainerStatus{ContainerStatus{Name:exporter-node-binary-copy,State:ContainerState{Waiting:&ContainerStateWaiting{Reason:CrashLoopBackOff,Message:back-off 5m0s restarting failed container=exporter-node-binary-copy pod=exporter-node-windows-cluster-monitoring-7lc6l_cattle-prometheus(52aacfee-2aab-4321-aad6-96ca0feaffc9),},Running:nil,Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:&ContainerStateTerminated{ExitCode:128,Signal:0,Reason:ContainerCannotRun,Message:container 784f0fb958deae0ac7cff7e905d356c8663298c83c6843145b8d5e50c0194751 encountered an error during Start: failure in a Windows system call: The virtual machine or container exited unexpectedly. (0xc0370106),StartedAt:2020-07-13 15:11:32 +0000 GMT,FinishedAt:2020-07-13 15:11:32 +0000 GMT,ContainerID:docker://784f0fb958deae0ac7cff7e905d356c8663298c83c6843145b8d5e50c0194751,},},Ready:false,RestartCount:846,Image:rancher/wmi_exporter-package:v0.0.3,ImageID:docker-pullable://rancher/wmi_exporter-package@sha256:556b1cf82783af8c0ec298315d48bf82cf6fb1864ffdf8736477a72d67af2be4,ContainerID:docker://784f0fb958deae0ac7cff7e905d356c8663298c83c6843145b8d5e50c0194751,Started:nil,},},NominatedNodeName:,PodIPs:[]PodIP{PodIP{IP:10.42.4.3,},},EphemeralContainerStatuses:[]ContainerStatus{},},} container id: 118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693

One other item to note: service-sidekick does a goroutine panic on a Windows Server Core VM while running but it reports that it recovers. This behavior is not seen on a normal windows server node.

I0710 15:23:40.415037    5272 device_windows.go:124] Created HostComputeNetwork vxlan0
panic: reflect: Field index out of range [recovered]
        panic: reflect: Field index out of range

goroutine 1 [running]:
encoding/json.(*encodeState).marshal.func1(0xc0001db540)
        /usr/local/go/src/encoding/json/encode.go:301 +0xa1
panic(0x121d7c0, 0x157a2b0)
        /usr/local/go/src/runtime/panic.go:513 +0x1c7
reflect.Value.Field(0x130ff60, 0xc00032b3e0, 0x99, 0x2d383632342d3032, 0xc0002e9e30, 0xc0000c82c0, 0x12ee720)
        /usr/local/go/src/reflect/value.go:816 +0x148
encoding/json.fieldByIndex(0x130ff60, 0xc00032b3e0, 0x99, 0xc00034a6d0, 0x1, 0x1, 0x12ee720, 0xc00032b3e0, 0x99)
        /usr/local/go/src/encoding/json/encode.go:842 +0x65
encoding/json.(*structEncoder).encode(0xc0002e9e00, 0xc0000c82c0, 0x130ff60, 0xc00032b3e0, 0x99, 0x1300100)
        /usr/local/go/src/encoding/json/encode.go:635 +0x149
encoding/json.(*structEncoder).encode-fm(0xc0000c82c0, 0x130ff60, 0xc00032b3e0, 0x99, 0xc0001d0100)
        /usr/local/go/src/encoding/json/encode.go:661 +0x6b
encoding/json.(*encodeState).reflectValue(0xc0000c82c0, 0x130ff60, 0xc00032b3e0, 0x99, 0x1fd0100)
        /usr/local/go/src/encoding/json/encode.go:333 +0x89
encoding/json.(*encodeState).marshal(0xc0000c82c0, 0x130ff60, 0xc00032b3e0, 0x100, 0x0, 0x0)
        /usr/local/go/src/encoding/json/encode.go:305 +0xfb
encoding/json.Marshal(0x130ff60, 0xc00032b3e0, 0x130ff60, 0xc00032b3e0, 0x0, 0x0, 0x0)
        /usr/local/go/src/encoding/json/encode.go:160 +0x59
github.com/coreos/flannel/vendor/github.com/Microsoft/hcsshim/hcn.modifyNetwork(0xc0004c6de0, 0x24, 0xc000514460, 0x6a, 0xc000514460, 0x6a, 0x0)
        /go/src/github.com/coreos/flannel/vendor/github.com/Microsoft/hcsshim/hcn/hcnnetwork.go:224 +0x1d1
github.com/coreos/flannel/vendor/github.com/Microsoft/hcsshim/hcn.(*HostComputeNetwork).ModifyNetworkSettings(0xc0003de1e0, 0xc000069d00, 0xc00002e540, 0x31)
        /go/src/github.com/coreos/flannel/vendor/github.com/Microsoft/hcsshim/hcn/hcnnetwork.go:356 +0x13c
github.com/coreos/flannel/vendor/github.com/Microsoft/hcsshim/hcn.(*HostComputeNetwork).AddPolicy(0xc0003de1e0, 0xc00032b1d0, 0x1, 0x1, 0x2, 0x8)
        /go/src/github.com/coreos/flannel/vendor/github.com/Microsoft/hcsshim/hcn/hcnnetwork.go:377 +0x1b9
github.com/coreos/flannel/backend/vxlan.ensureLink(0xc000125480, 0xc000125480, 0x130fc60, 0x1)
        /go/src/github.com/coreos/flannel/backend/vxlan/device_windows.go:147 +0x348
github.com/coreos/flannel/backend/vxlan.newVXLANDevice(0xc0002e9b30, 0xc0002e9b30, 0xc00023a4c0, 0xc0001baf60)
        /go/src/github.com/coreos/flannel/backend/vxlan/device_windows.go:47 +0x7a
github.com/coreos/flannel/backend/vxlan.(*VXLANBackend).RegisterNetwork(0xc0006310a0, 0x2330138, 0xc00023a4c0, 0x100000000, 0xc000000000, 0xc0000ae640, 0x0, 0xc00009c082, 0xf, 0xc000663698)
        /go/src/github.com/coreos/flannel/backend/vxlan/vxlan_windows.go:149 +0x514
main.main()
        /go/src/github.com/coreos/flannel/main.go:289 +0x735

More logs available on request. Cluster information

Cluster type: Hosted Rancher Machine type: AWS t3a.xlarge AMI: Windows_Server-1909-English-Core-Base-2020.06.10 (ami-0e9662f7dc78f5a18) OS: Windows Server Datacenter 10.0.18363.900 Rancher v2.4.5 K8s v1.18.3 Docker version 19.03.5, build 2ee0c57608

gz#11061 gz#13205

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (12 by maintainers)

Commits related to this issue

Update wmi_exporter to v0.0.4 Related Issue: https://github.com/rancher/rancher/issues/27911 Why is it related? In Feb 11, 2020 Windows rolled out a security update that changed the interface betwe... — committed to aiyengar2/system-charts by aiyengar2 3 years ago
Update wmi_exporter to v0.0.5 Related Issue: https://github.com/rancher/rancher/issues/27911 Why is it related? In Feb 11, 2020 Windows rolled out a security update that changed the interface betwe... — committed to aiyengar2/system-charts by aiyengar2 3 years ago
Update wmi_exporter to v0.0.5 Related Issue: https://github.com/rancher/rancher/issues/27911 Why is it related? In Feb 11, 2020 Windows rolled out a security update that changed the interface betwe... — committed to aiyengar2/system-charts by aiyengar2 3 years ago
Update wmi_exporter to v0.0.5 Related Issue: https://github.com/rancher/rancher/issues/27911 Why is it related? In Feb 11, 2020 Windows rolled out a security update that changed the interface betwe... — committed to aiyengar2/system-charts by aiyengar2 3 years ago

Most upvoted comments

@SheilaghM We have validated monitoring on windows as said here using Windows_Server-1909-English-Core-ContainersLatest-2021.01.13. Monitoring v1 works on 1903/1809. It also works on 1909 using GCE nodes. This issue seems specific to AWS AMIs.

sowmyav27 on Feb 1, 2021

On 2.5.5, using AMI from GCP - windows-server-1909-dc-core-for-containers-v20210112

Deploy Windows 1909 - 1 etcd/control/worker linux node, 2 linux worker nodes and 3 windows worker nodes.
Enable monitoring v1 - 0.2.0. Monitoring is seen enabled successfully.
exporter-node-windows-cluster-monitoring workload deployed successfully on Windows nodes.
Metrics from windows nodes are available.

sowmyav27 on Jan 27, 2021

On 2.5.4-rc7

Deployed 1909 windows server core version
monitoring v1 fails to deploy
Error on exporter-node-windows-cluster-monitoring pod: Error: failed to start container "exporter-node-binary-copy": Error response from daemon: container exporter-node-binary-copy encountered an error during hcsshim::System::Start: failure in a Windows system call: The virtual machine or container exited unexpectedly. (0xc0370106)

sowmyav27 on Dec 23, 2020

@aiyengar2

Did the cluster that you were working with have Windows support already enabled via Rancher (on cluster create)?

Yes, it did. Seeing as the cluster is long gone at this point, I’m spinning up a new 2.4.11 mixed cluster with k8s v1.18.12 to validate the level of functionality of Windows 1909 Server Core. I’ll update here with my results.

rosskirkpat on Dec 15, 2020