kubernetes: [Flaky unit test] Test_Run_OneVolumeAttachAndDetachMultipleNodesWithReadWriteMany flakes with "Wrong total GetAttachCallCount" error

Which test(s) are flaking:

Test_Run_OneVolumeAttachAndDetachMultipleNodesWithReadWriteMany

Reason for failure:

--- FAIL: Test_Run_OneVolumeAttachAndDetachMultipleNodesWithReadWriteMany (0.63s)
    reconciler_test.go:811: Warning: Wrong NewAttacherCallCount. Expected: <2> Actual: <0>. Will retry.
    reconciler_test.go:811: Warning: Wrong NewAttacherCallCount. Expected: <2> Actual: <1>. Will retry.
    reconciler_test.go:909: Warning: Wrong total GetAttachCallCount(). Expected: <2> Actual: <3>. Will retry.
    reconciler_test.go:909: Warning: Wrong total GetAttachCallCount(). Expected: <2> Actual: <3>. Will retry.
    reconciler_test.go:909: Warning: Wrong total GetAttachCallCount(). Expected: <2> Actual: <3>. Will retry.
    reconciler_test.go:909: Warning: Wrong total GetAttachCallCount(). Expected: <2> Actual: <3>. Will retry.
    reconciler_test.go:909: Warning: Wrong total GetAttachCallCount(). Expected: <2> Actual: <3>. Will retry.
    reconciler_test.go:909: Warning: Wrong total GetAttachCallCount(). Expected: <2> Actual: <3>. Will retry.
    reconciler_test.go:919: Total AttachCallCount does not match expected value. Expected: <2>
FAIL

This is flaking rarely enough that it is not caught by our CI jobs, which currently tolerate up to 2 unit test failures per run (!).

With that toleration removed in https://github.com/kubernetes/kubernetes/pull/93605, this flake has been seen (https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/93605/pull-kubernetes-bazel-test/1293257915502694408).

Reproducible with the following steps:

  1. Build a test binary for the affected package:
go test -race -c ./pkg/controller/volume/attachdetach/reconciler
  1. Stress the affected test using the stress tool:
stress ./reconciler.test -test.run Test_Run_OneVolumeAttachAndDetachMultipleNodesWithReadWriteMany
  1. Once failures are seen, look at the logs to see the details about the error:
ls -lat $TMPDIR/go-stress*
cat $TMPDIR/go-stress...<filename>

/sig storage cc @kubernetes/sig-storage-test-failures

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

Generally, I think idempotent is a requirement. Controller cannot guarantee to avoiding calling concurrently, e.g., when controller restarts, it loses the information about in-flight operations. So I think it makes sense to relax tests a little bit.

The issue can be simplified to this: there are two types of threads, and boolean registers A and B initialized to false (A being whether the volume is marked as attached in ASW and B being whether the operation is pending).

main thread:
  while true:
    a = read(A)
    b = read(B)
    if not a and not b:
      set(B, true)
      spawnOperationThread()
operation thread:
  set(A, true)
  set(B, false)

This can be fixed by instead calling read(B) before read(A) in the main thread. B == false implies either there were no operations before it, or the previous operation has already set A. Because the main thread spawns the operation thread, there should be no operation thread running that could set A. Collectively, A is guaranteed to be up-to-date when it’s read.

This translates to checking operation pending before checking ASW.