kubernetes: [Flaky Test] [sig-apps] DisruptionController should block an eviction until the PDB is updated to allow it
Which jobs are flaking: all of the e2e jobs that run this test
Which test(s) are flaking: `
[sig-apps] DisruptionController should block an eviction until the PDB is updated to allow it
Testgrid link:
https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default&width=5
Reason for failure:
mostly:
test/e2e/apps/disruption.go:273
Jul 9 23:34:27.116: Expected an error, got nil
test/e2e/apps/disruption.go:327
Anything else we need to know:
- links to go.k8s.io/triage appreciated
- links to specific failures in spyglass appreciated
/sig apps
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (15 by maintainers)
Thanks for the update @hasheddan
Thanks everyone and apologies for the false alarm @hakman ๐
I just about have a fix ready here. I was able to narrow the commit range to https://github.com/kubernetes/kubernetes/compare/82baa2690...dd649bb7e by looking at when this flake first occurred in https://testgrid.k8s.io/sig-release-master-blocking#kind-ipv6-master-parallel. It looks like @liggitt was correct that this was introduced in #91342. I believe what is happening is that in the test we are getting getting a
Podwith aDeletionTimestampset (the one that we previously successfully issued an eviction for) before waiting for 3 pods to be ready again in the namespace (https://github.com/kubernetes/kubernetes/blob/c2d15418316e9a02bf5692de80064556fb4f89f0/test/e2e/apps/disruption.go#L317). Before the aforementioned PR caused us to skip checking PDBs whenDeletionTimestampis set we would still report back that eviction was not possible because ourPodwithDeletionTimestampdid not meet the other criteria (i.e.Phasewas notSucceeded,Failed, orPending), which caused us to still check PDBs, which would show us that we did not have the budget to delete (even though it was not really taking into account thePodthat we were actually attempting to evict).I think there are a few things we can do to address this:
waitForPodsOrDiewe should check that not only is thePodReady, but also that it does not haveDeletionTimestampset.waitForPodsOrDieabove where we calllocateRunningPodso that we do not attempt to get aPodbefore all of the pods are ready.locateRunningPodthat thePoddoes not have aDeletionTimestamp.Just doing (3) by itself would address our current problem, but (1) and (2) also seem to make sense in the context of what we are trying to check here.
https://github.com/kubernetes/kubernetes/pull/91342 merged recently, not sure if itโs relevant