rancher: Monitoring does not work for Windows Server Core worker nodes
What kind of request is this (question/bug/enhancement/feature request): bug
Steps to reproduce (least amount of steps as possible):
- Create new Windows Server Core VM
- Add the windows server core VM to Rancher as a worker node
- enable cluster monitoring (or have it already enabled in the cluster prior to joining)
Result:
exporter-node-binary-copy container fails to deploy with encountered an error during Start: failure in a Windows system call: The virtual machine or container exited unexpectedly. (0xc0370106)
. This prevents the exporter-node-windows-cluster-monitoring workload to finish deploying on the node.
Other details that may be helpful:
The same containers deploy and work without an issue on Windows Server (with Desktop experience, aka normal Windows Server) and appears isolated to Windows Server Core.
W0713 15:16:42.243971 7340 docker_container.go:238] Cannot create symbolic link because container log file doesn’t exist!
E0713 15:16:42.243971 7340 remote_runtime.go:222] StartContainer “118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693” from runtime service failed: rpc error: code = Unknown desc = failed to start container “118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693”: Error response from daemon: container 118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693 encountered an error during Start: failure in a Windows system call: The virtual machine or container exited unexpectedly. (0xc0370106) E0713 15:16:42.245009 7340 kuberuntime_manager.go:801] init container start failed: RunContainerError: failed to start container “118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693”: Error response from daemon: container 118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693 encountered an error during Start: failure in a Windows system call: The virtual machine or container exited unexpectedly. (0xc0370106)
E0713 15:16:42.245009 7340 pod_workers.go:191] Error syncing pod 52aacfee-2aab-4321-aad6-96ca0feaffc9 (“exporter-node-windows-cluster-monitoring-7lc6l_cattle-prometheus(52aacfee-2aab-4321-aad6-96ca0feaffc9)”), skipping: failed to “StartContainer” for “exporter-node-binary-copy” with RunContainerError: “failed to start container "118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693": Error response from daemon: container 118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693 encountered an error during Start: failure in a Windows system call: The virtual machine or container exited unexpectedly. (0xc0370106)”
StartTime:2020-07-10 15:23:27 +0000 GMT,ContainerStatuses:[]ContainerStatus{ContainerStatus{Name:exporter-node,State:ContainerState{Waiting:&ContainerStateWaiting{Reason:PodInitializing,Message:,},Running:nil,Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:rancher/wmi_exporter-package:v0.0.3,ImageID:,ContainerID:,Started:*false,},},QOSClass:BestEffort,InitContainerStatuses:[]ContainerStatus{ContainerStatus{Name:exporter-node-binary-copy,State:ContainerState{Waiting:&ContainerStateWaiting{Reason:CrashLoopBackOff,Message:back-off 5m0s restarting failed container=exporter-node-binary-copy pod=exporter-node-windows-cluster-monitoring-7lc6l_cattle-prometheus(52aacfee-2aab-4321-aad6-96ca0feaffc9),},Running:nil,Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:&ContainerStateTerminated{ExitCode:128,Signal:0,Reason:ContainerCannotRun,Message:container 784f0fb958deae0ac7cff7e905d356c8663298c83c6843145b8d5e50c0194751 encountered an error during Start: failure in a Windows system call: The virtual machine or container exited unexpectedly. (0xc0370106),StartedAt:2020-07-13 15:11:32 +0000 GMT,FinishedAt:2020-07-13 15:11:32 +0000 GMT,ContainerID:docker://784f0fb958deae0ac7cff7e905d356c8663298c83c6843145b8d5e50c0194751,},},Ready:false,RestartCount:846,Image:rancher/wmi_exporter-package:v0.0.3,ImageID:docker-pullable://rancher/wmi_exporter-package@sha256:556b1cf82783af8c0ec298315d48bf82cf6fb1864ffdf8736477a72d67af2be4,ContainerID:docker://784f0fb958deae0ac7cff7e905d356c8663298c83c6843145b8d5e50c0194751,Started:nil,},},NominatedNodeName:,PodIPs:[]PodIP{PodIP{IP:10.42.4.3,},},EphemeralContainerStatuses:[]ContainerStatus{},},} container id: 118f9659f56343cba90557e25f19a5ee47c15ab524b5156f143505d6107b8693
One other item to note: service-sidekick does a goroutine panic on a Windows Server Core VM while running but it reports that it recovers. This behavior is not seen on a normal windows server node.
I0710 15:23:40.415037 5272 device_windows.go:124] Created HostComputeNetwork vxlan0
panic: reflect: Field index out of range [recovered]
panic: reflect: Field index out of range
goroutine 1 [running]:
encoding/json.(*encodeState).marshal.func1(0xc0001db540)
/usr/local/go/src/encoding/json/encode.go:301 +0xa1
panic(0x121d7c0, 0x157a2b0)
/usr/local/go/src/runtime/panic.go:513 +0x1c7
reflect.Value.Field(0x130ff60, 0xc00032b3e0, 0x99, 0x2d383632342d3032, 0xc0002e9e30, 0xc0000c82c0, 0x12ee720)
/usr/local/go/src/reflect/value.go:816 +0x148
encoding/json.fieldByIndex(0x130ff60, 0xc00032b3e0, 0x99, 0xc00034a6d0, 0x1, 0x1, 0x12ee720, 0xc00032b3e0, 0x99)
/usr/local/go/src/encoding/json/encode.go:842 +0x65
encoding/json.(*structEncoder).encode(0xc0002e9e00, 0xc0000c82c0, 0x130ff60, 0xc00032b3e0, 0x99, 0x1300100)
/usr/local/go/src/encoding/json/encode.go:635 +0x149
encoding/json.(*structEncoder).encode-fm(0xc0000c82c0, 0x130ff60, 0xc00032b3e0, 0x99, 0xc0001d0100)
/usr/local/go/src/encoding/json/encode.go:661 +0x6b
encoding/json.(*encodeState).reflectValue(0xc0000c82c0, 0x130ff60, 0xc00032b3e0, 0x99, 0x1fd0100)
/usr/local/go/src/encoding/json/encode.go:333 +0x89
encoding/json.(*encodeState).marshal(0xc0000c82c0, 0x130ff60, 0xc00032b3e0, 0x100, 0x0, 0x0)
/usr/local/go/src/encoding/json/encode.go:305 +0xfb
encoding/json.Marshal(0x130ff60, 0xc00032b3e0, 0x130ff60, 0xc00032b3e0, 0x0, 0x0, 0x0)
/usr/local/go/src/encoding/json/encode.go:160 +0x59
github.com/coreos/flannel/vendor/github.com/Microsoft/hcsshim/hcn.modifyNetwork(0xc0004c6de0, 0x24, 0xc000514460, 0x6a, 0xc000514460, 0x6a, 0x0)
/go/src/github.com/coreos/flannel/vendor/github.com/Microsoft/hcsshim/hcn/hcnnetwork.go:224 +0x1d1
github.com/coreos/flannel/vendor/github.com/Microsoft/hcsshim/hcn.(*HostComputeNetwork).ModifyNetworkSettings(0xc0003de1e0, 0xc000069d00, 0xc00002e540, 0x31)
/go/src/github.com/coreos/flannel/vendor/github.com/Microsoft/hcsshim/hcn/hcnnetwork.go:356 +0x13c
github.com/coreos/flannel/vendor/github.com/Microsoft/hcsshim/hcn.(*HostComputeNetwork).AddPolicy(0xc0003de1e0, 0xc00032b1d0, 0x1, 0x1, 0x2, 0x8)
/go/src/github.com/coreos/flannel/vendor/github.com/Microsoft/hcsshim/hcn/hcnnetwork.go:377 +0x1b9
github.com/coreos/flannel/backend/vxlan.ensureLink(0xc000125480, 0xc000125480, 0x130fc60, 0x1)
/go/src/github.com/coreos/flannel/backend/vxlan/device_windows.go:147 +0x348
github.com/coreos/flannel/backend/vxlan.newVXLANDevice(0xc0002e9b30, 0xc0002e9b30, 0xc00023a4c0, 0xc0001baf60)
/go/src/github.com/coreos/flannel/backend/vxlan/device_windows.go:47 +0x7a
github.com/coreos/flannel/backend/vxlan.(*VXLANBackend).RegisterNetwork(0xc0006310a0, 0x2330138, 0xc00023a4c0, 0x100000000, 0xc000000000, 0xc0000ae640, 0x0, 0xc00009c082, 0xf, 0xc000663698)
/go/src/github.com/coreos/flannel/backend/vxlan/vxlan_windows.go:149 +0x514
main.main()
/go/src/github.com/coreos/flannel/main.go:289 +0x735
More logs available on request. Cluster information
Cluster type: Hosted Rancher Machine type: AWS t3a.xlarge AMI: Windows_Server-1909-English-Core-Base-2020.06.10 (ami-0e9662f7dc78f5a18) OS: Windows Server Datacenter 10.0.18363.900 Rancher v2.4.5 K8s v1.18.3 Docker version 19.03.5, build 2ee0c57608
gz#11061 gz#13205
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (12 by maintainers)
Commits related to this issue
- Update wmi_exporter to v0.0.4 Related Issue: https://github.com/rancher/rancher/issues/27911 Why is it related? In Feb 11, 2020 Windows rolled out a security update that changed the interface betwe... — committed to aiyengar2/system-charts by aiyengar2 3 years ago
- Update wmi_exporter to v0.0.5 Related Issue: https://github.com/rancher/rancher/issues/27911 Why is it related? In Feb 11, 2020 Windows rolled out a security update that changed the interface betwe... — committed to aiyengar2/system-charts by aiyengar2 3 years ago
- Update wmi_exporter to v0.0.5 Related Issue: https://github.com/rancher/rancher/issues/27911 Why is it related? In Feb 11, 2020 Windows rolled out a security update that changed the interface betwe... — committed to aiyengar2/system-charts by aiyengar2 3 years ago
- Update wmi_exporter to v0.0.5 Related Issue: https://github.com/rancher/rancher/issues/27911 Why is it related? In Feb 11, 2020 Windows rolled out a security update that changed the interface betwe... — committed to aiyengar2/system-charts by aiyengar2 3 years ago
@SheilaghM We have validated monitoring on windows as said here using
Windows_Server-1909-English-Core-ContainersLatest-2021.01.13
. Monitoring v1 works on 1903/1809. It also works on 1909 using GCE nodes. This issue seems specific to AWS AMIs.On 2.5.5, using AMI from GCP -
windows-server-1909-dc-core-for-containers-v20210112
exporter-node-windows-cluster-monitoring
workload deployed successfully on Windows nodes.On
2.5.4-rc7
exporter-node-windows-cluster-monitoring
pod:Error: failed to start container "exporter-node-binary-copy": Error response from daemon: container exporter-node-binary-copy encountered an error during hcsshim::System::Start: failure in a Windows system call: The virtual machine or container exited unexpectedly. (0xc0370106)
@aiyengar2
Yes, it did. Seeing as the cluster is long gone at this point, I’m spinning up a new 2.4.11 mixed cluster with k8s v1.18.12 to validate the level of functionality of Windows 1909 Server Core. I’ll update here with my results.