kubernetes: Sig windows Main branch tests are failing since merge of Device manager for Windows
Which jobs are failing: Flakes: https://testgrid.k8s.io/sig-windows-containerd#aks-engine-azure-windows-master-containerd https://testgrid.k8s.io/sig-windows-releases#aks-engine-azure-windows-master-staging
Full failure: https://testgrid.k8s.io/sig-windows-releases#aks-engine-azure-windows-master-staging-serial-slow
Which test(s) are failing:
Since when has it been failing: Since merging https://github.com/kubernetes/kubernetes/pull/93285
Testgrid link:
Reason for failure:
Serial slow tests are failing with:
manager.go:270]
failed to listen to socket while starting device plugin registry, with
error listen unix C:\var\lib\kubelet\device-plugins\kubelet.sock: bind:
Only one usage of each socket address (protocol/network address/port) is
normally permitted.
and the parallel tests are flaking with:
cri_stats_provider.go:583] Unable to fetch container log stats: failed to get fsstats for "\\var\\log\\pods\\kube-system_directx-device-plugin-zhmh8_788b6352-0117-43d6-95dd-1ec82da6b196\\hostdev\\0.log": failed to get FsInfo due to error The directory name is invalid.
Anything else we need to know: I would not expect kubelet to crash if the node if the plugin is not present.
/sig windows
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (11 by maintainers)
Thanks for your quick response and analysis 👍
You can find the kubelet logs (and many others) for the windows nodes in the “artifacts” tab on https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-staging-windows-serial-slow/1343693191026577408
This set of tests runs configuration for GMSA tests which as part of the set up of the domain controller, causes a node resets. This explains why one of the nodes seems ok (it isn’t randomly selected as part of the domain machines) and why the parallel tests don’t cause a full failure (they don’t cause node resets).
https://github.com/kubernetes-sigs/windows-testing/blob/abb08b4bbcf3f2605370908d580d773b0903a345/extensions/gmsa-dc/v1/Setup-gMSA.ps1#L276
https://github.com/kubernetes-sigs/windows-testing/blob/abb08b4bbcf3f2605370908d580d773b0903a345/extensions/gmsa-member/v1/Join-Domain.ps1#L99
Not all clusters use that script. This set of tests don’t use them either (uses aks-engine which has a different start up script). IMHO this is a breaking change scince any cluster that has a node restart would fail to come back online. and kubelet should be able to restart (because of node restart or crash for another reason, etc.)
It looks like there is code to clean up old sockets but it is not working on windows: https://github.com/kubernetes/kubernetes/blob/b860d08e4ba0c89724046f4ead26b9a535ebdad0/pkg/kubelet/cm/devicemanager/manager.go#L262-L266
Given there is clean up code there and it’s a breaking change I think this this should be fixed in kubelet.
Are you sure it’s call twice ?
because on windows
container_manager_linux.go
is not present but replace by/container_manager_windows.go
you can see the first line comment// +build windows
and// +build linux
.A simple test is:
Maybe the fix can be add in
https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/start-kubelet.ps1