kubernetes: Sig windows Main branch tests are failing since merge of Device manager for Windows

Which jobs are failing: Flakes: https://testgrid.k8s.io/sig-windows-containerd#aks-engine-azure-windows-master-containerd https://testgrid.k8s.io/sig-windows-releases#aks-engine-azure-windows-master-staging

Full failure: https://testgrid.k8s.io/sig-windows-releases#aks-engine-azure-windows-master-staging-serial-slow

Which test(s) are failing:

Since when has it been failing: Since merging https://github.com/kubernetes/kubernetes/pull/93285

Testgrid link:

Reason for failure:

Serial slow tests are failing with:

manager.go:270]
 failed to listen to socket while starting device plugin registry, with
error listen unix C:\var\lib\kubelet\device-plugins\kubelet.sock: bind:
Only one usage of each socket address (protocol/network address/port) is
 normally permitted.

and the parallel tests are flaking with:

 cri_stats_provider.go:583] Unable to fetch container log stats: failed to get fsstats for "\\var\\log\\pods\\kube-system_directx-device-plugin-zhmh8_788b6352-0117-43d6-95dd-1ec82da6b196\\hostdev\\0.log": failed to get FsInfo due to error The directory name is invalid.

Anything else we need to know: I would not expect kubelet to crash if the node if the plugin is not present.

/sig windows

cc: @ddebroy @thomacos @aarnaud

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 19 (11 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks for your quick response and analysis 👍

Which log exactly do you get this error from? It would be interesting to see if /cm/devicemanager/manager.go#L241 shows twice.

You can find the kubelet logs (and many others) for the windows nodes in the “artifacts” tab on https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-staging-windows-serial-slow/1343693191026577408

That it is not called twice within a single kubelet initialization, but kubelet is initialized again for some reason on the same node.

(It still means that it is called twice, but I didn’t think about the possibility that kubelet is initialized twice.)

This set of tests runs configuration for GMSA tests which as part of the set up of the domain controller, causes a node resets. This explains why one of the nodes seems ok (it isn’t randomly selected as part of the domain machines) and why the parallel tests don’t cause a full failure (they don’t cause node resets).

https://github.com/kubernetes-sigs/windows-testing/blob/abb08b4bbcf3f2605370908d580d773b0903a345/extensions/gmsa-dc/v1/Setup-gMSA.ps1#L276

https://github.com/kubernetes-sigs/windows-testing/blob/abb08b4bbcf3f2605370908d580d773b0903a345/extensions/gmsa-member/v1/Join-Domain.ps1#L99

Maybe the fix can be add in https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/start-kubelet.ps1

Not all clusters use that script. This set of tests don’t use them either (uses aks-engine which has a different start up script). IMHO this is a breaking change scince any cluster that has a node restart would fail to come back online. and kubelet should be able to restart (because of node restart or crash for another reason, etc.)

But I would be interested in learning how this is handled under Linux. To choose the same path for Windows, too.

It looks like there is code to clean up old sockets but it is not working on windows: https://github.com/kubernetes/kubernetes/blob/b860d08e4ba0c89724046f4ead26b9a535ebdad0/pkg/kubelet/cm/devicemanager/manager.go#L262-L266

Given there is clean up code there and it’s a breaking change I think this this should be fixed in kubelet.

Are you sure it’s call twice ?

because on windows container_manager_linux.go is not present but replace by /container_manager_windows.go you can see the first line comment // +build windows and // +build linux.

A simple test is:

  1. stop kubelet and check the socket file is still present
  2. start kubelet you got the error, stop it
  3. remove socket file
  4. start kubelet again, normally it’s working

Maybe the fix can be add in https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/start-kubelet.ps1