kubernetes: [Flaky Test] [sig-apps] DisruptionController should block an eviction until the PDB is updated to allow it

Which jobs are flaking: all of the e2e jobs that run this test

Which test(s) are flaking: `

[sig-apps] DisruptionController should block an eviction until the PDB is updated to allow it

Testgrid link:

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&test=DisruptionController should block an eviction until the PDB is updated to allow it

https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default&width=5

Reason for failure:

mostly:

test/e2e/apps/disruption.go:273
Jul  9 23:34:27.116: Expected an error, got nil
test/e2e/apps/disruption.go:327

Anything else we need to know:

links to go.k8s.io/triage appreciated
links to specific failures in spyglass appreciated

/sig apps

xref: https://github.com/kubernetes/kubernetes/issues/92937

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (15 by maintainers)

Most upvoted comments

Thanks for the update @hasheddan

hakman on Jul 11, 2020

Thanks everyone and apologies for the false alarm @hakman 😃

BenTheElder on Jul 11, 2020

I just about have a fix ready here. I was able to narrow the commit range to https://github.com/kubernetes/kubernetes/compare/82baa2690...dd649bb7e by looking at when this flake first occurred in https://testgrid.k8s.io/sig-release-master-blocking#kind-ipv6-master-parallel. It looks like @liggitt was correct that this was introduced in #91342. I believe what is happening is that in the test we are getting getting a Pod with a DeletionTimestamp set (the one that we previously successfully issued an eviction for) before waiting for 3 pods to be ready again in the namespace (https://github.com/kubernetes/kubernetes/blob/c2d15418316e9a02bf5692de80064556fb4f89f0/test/e2e/apps/disruption.go#L317). Before the aforementioned PR caused us to skip checking PDBs when DeletionTimestamp is set we would still report back that eviction was not possible because our Pod with DeletionTimestamp did not meet the other criteria (i.e. Phase was not Succeeded, Failed, or Pending), which caused us to still check PDBs, which would show us that we did not have the budget to delete (even though it was not really taking into account the Pod that we were actually attempting to evict).

I think there are a few things we can do to address this:

In waitForPodsOrDie we should check that not only is the Pod Ready, but also that it does not have DeletionTimestamp set.
Move waitForPodsOrDie above where we call locateRunningPod so that we do not attempt to get a Pod before all of the pods are ready.
Check in locateRunningPod that the Pod does not have a DeletionTimestamp.

Just doing (3) by itself would address our current problem, but (1) and (2) also seem to make sense in the context of what we are trying to check here.

hasheddan on Jul 11, 2020

https://github.com/kubernetes/kubernetes/pull/91342 merged recently, not sure if it’s relevant

liggitt on Jul 11, 2020