kubernetes: Flaking Test: subpath failures in new-master-upgrade-cluster-new-parallel, other jobs

Which jobs are failing: gce-new-master-upgrade-cluster-new-parallel

Which test(s) are failing:

Varies, but all subpath failures, including:

  • [sig-storage] CSI Volumes [Driver: csi-hostpath-v0] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath with backstepping is outside the volume [Slow]
  • [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath directory is outside the volume [Slow]
  • [sig-storage] CSI Volumes [Driver: csi-hostpath-v0] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath file is outside the volume [Slow]
  • [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath file is outside the volume [Slow]

… and pretty much every other subpath test, but never all of them at once.

There’s also a few other storage tests failing, such as:

  • [sig-storage] Volume expand [Slow] Verify if editing PVC allows resize
  • [sig-storage] Detaching volumes should not work when mount is in progress

Since when has it been failing: 11/22

Testgrid link: https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster-new-parallel&include-filter-by-regex=.*CSI.*&width=20

Reason for failure:

These flakes started around the time that #71314 merged, but doesn’t match up with the exact merge stamp, so it’s probably coincidental.

The subpath test failures seem to be mostly timeouts:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:254
while waiting for failed event to occur
Expected error:
    <*errors.errorString | 0xc0000d1860>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:601

… so possibly this is just GCE fail.

Anything else we need to know:

This test job has always been flaky, with around a 40% failure rate.

/kind flake /sig storage /priority important-soon

cc @saad-ali @AishSundar @liggitt

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 31 (27 by maintainers)

Most upvoted comments

#71570 and #71569 have been merged and will address the biggest issues. #71570 has been backported to 1.13 already, #71569 backport is still pending.

I saw the same on https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new-parallel/422

CSI volume mount succeeded at:

I1128 12:22:26.251674    1451 operation_generator.go:567] MountVolume.SetUp succeeded for volume "pvc-1e53cb5f-f308-11e8-b109-42010a800002" (UniqueName: "kubernetes.io/csi/csi-hostpath-v0-e2e-tests-csi-volumes-ghd78^2079fc5e-f308-11e8-89b9-426ec44f83b8") pod "pod-subpath-test-csi-hostpath-v0-dynamicpv-s8xl" (UID: "20c6ef48-f308-11e8-b109-42010a800002")

Kubelet correctly failed the pod because of subpath

I1128 12:22:45.494961    1451 server.go:459] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-csi-volumes-ghd78", Name:"pod-subpath-test-csi-hostpath-v0-dynamicpv-s8xl", UID:"20c6ef48-f308-11e8-b109-42010a800002", APIVersion:"v1", ResourceVersion:"38952", FieldPath:"spec.containers{test-container-subpath-csi-hostpath-v0-dynamicpv-s8xl}"}): type: 'Warning' reason: 'Failed' Error: failed to prepare subPath for volumeMount "test-volume" of container "test-container-subpath-csi-hostpath-v0-dynamicpv-s8xl"

But test still failed:

Nov 28 12:26:27.070: INFO: Deleting pod "pod-subpath-test-csi-hostpath-v0-dynamicpv-s8xl" in namespace "e2e-tests-csi-volumes-ghd78"

Because the test failed to find the pod event.

This looks like a test issue. Not a 1.13 blocker.

@saad-ali @msau42

These all 4 subpath tests call testPodFailSubpathError and it fails in finding a specific failed event by using WaitTimeoutForPodEvent. As WaitTimeoutForPodEvent uses eventOccured to check if the specific error event happens, and eventOccured only checks first event, I guess that tests may flake if other failed event happens before the expected one.

Also, WaitTimeoutForPodEvent already wait for the event, we might be able to delete WaitForPodRunningInNamespace in testPodFailSubpathError.

I will create a PR to fix above.