kubernetes: [Flaky unit test] Test_Run_OneVolumeAttachAndDetachMultipleNodesWithReadWriteMany flakes with "Wrong total GetAttachCallCount" error
Which test(s) are flaking:
Test_Run_OneVolumeAttachAndDetachMultipleNodesWithReadWriteMany
Reason for failure:
--- FAIL: Test_Run_OneVolumeAttachAndDetachMultipleNodesWithReadWriteMany (0.63s)
reconciler_test.go:811: Warning: Wrong NewAttacherCallCount. Expected: <2> Actual: <0>. Will retry.
reconciler_test.go:811: Warning: Wrong NewAttacherCallCount. Expected: <2> Actual: <1>. Will retry.
reconciler_test.go:909: Warning: Wrong total GetAttachCallCount(). Expected: <2> Actual: <3>. Will retry.
reconciler_test.go:909: Warning: Wrong total GetAttachCallCount(). Expected: <2> Actual: <3>. Will retry.
reconciler_test.go:909: Warning: Wrong total GetAttachCallCount(). Expected: <2> Actual: <3>. Will retry.
reconciler_test.go:909: Warning: Wrong total GetAttachCallCount(). Expected: <2> Actual: <3>. Will retry.
reconciler_test.go:909: Warning: Wrong total GetAttachCallCount(). Expected: <2> Actual: <3>. Will retry.
reconciler_test.go:909: Warning: Wrong total GetAttachCallCount(). Expected: <2> Actual: <3>. Will retry.
reconciler_test.go:919: Total AttachCallCount does not match expected value. Expected: <2>
FAIL
This is flaking rarely enough that it is not caught by our CI jobs, which currently tolerate up to 2 unit test failures per run (!).
With that toleration removed in https://github.com/kubernetes/kubernetes/pull/93605, this flake has been seen (https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/93605/pull-kubernetes-bazel-test/1293257915502694408).
Reproducible with the following steps:
- Build a test binary for the affected package:
go test -race -c ./pkg/controller/volume/attachdetach/reconciler
- Stress the affected test using the stress tool:
stress ./reconciler.test -test.run Test_Run_OneVolumeAttachAndDetachMultipleNodesWithReadWriteMany
- Once failures are seen, look at the logs to see the details about the error:
ls -lat $TMPDIR/go-stress*
cat $TMPDIR/go-stress...<filename>
/sig storage cc @kubernetes/sig-storage-test-failures
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (15 by maintainers)
Generally, I think idempotent is a requirement. Controller cannot guarantee to avoiding calling concurrently, e.g., when controller restarts, it loses the information about in-flight operations. So I think it makes sense to relax tests a little bit.
The issue can be simplified to this: there are two types of threads, and boolean registers A and B initialized to false (A being whether the volume is marked as attached in ASW and B being whether the operation is pending).
This can be fixed by instead calling
read(B)beforeread(A)in the main thread.B == falseimplies either there were no operations before it, or the previous operation has already set A. Because the main thread spawns the operation thread, there should be no operation thread running that could set A. Collectively, A is guaranteed to be up-to-date when it’s read.This translates to checking operation pending before checking ASW.