kubernetes: kubelet: Race condition in nodeshutdown unit test

What happened?

There seems to be a race condition in the following unit test

https://github.com/kubernetes/kubernetes/blob/40c2d049465417f510e4182b05953a49fc5693d4/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go#L617

The race happens between read at https://github.com/kubernetes/kubernetes/blob/40c2d049465417f510e4182b05953a49fc5693d4/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go#L707

and write at https://github.com/kubernetes/kubernetes/blob/40c2d049465417f510e4182b05953a49fc5693d4/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L328

What did you expect to happen?

No race condition

How can we reproduce it (as minimally and precisely as possible)?

cd $KUBE_ROOT/pkg/kubelet/nodeshutdown
go test -c -race
stress ./nodeshutdown.test -test.run ^Test_managerImpl_processShutdownEvent$

Anything else we need to know?

CI logs where race was detected: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/108039/pull-kubernetes-unit/1491676749694504960

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (15 by maintainers)

Commits related to this issue

Most upvoted comments

seen in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/109048/pull-kubernetes-unit/1507725298571939840

this spiked in the last couple days - https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=TestLocalStorage

is this related to the klog bump (#108725)?

cc @pohly

I don’t see how the klog bump could have made it worse. Perhaps the change around the flush daemon changed some timing conditions, but that’s a rather wild guess. These tests have been faulty all along and need to be fixed.

What klog can do is support unit tests like this better. I’ve opened two issues:

increasing priority since it is impacting test runs, and marking as blocker until root caused and we understand whether it is prod-impacting

@liggitt I think it’s safe to remove the release-blocker here. The root cause is largely as explained here: https://github.com/kubernetes/kubernetes/issues/108040#issuecomment-1040122707 (appears to be test-only)