kubernetes: Pod Failure Policy Edge Case: Job Retries When Pod Finishes Successfully
What happened?
I’ve noticed an edge case when trying out the PodFailurePolicy example from Story 3 of the KEP. When a pod is marked as a DisruptionTarget but then completes successful, it retries anyways. I wasn’t able to add some sort of Ignore rule to exclude the 0 case
What did you expect to happen?
I’m not sure what the expected behavior should be, but I expected that the job should succeed.
How can we reproduce it (as minimally and precisely as possible)?
Here’s a job that reproduces the issue 99% of the time
apiVersion: batch/v1
kind: Job
metadata:
name: test-job
spec:
template:
spec:
activeDeadlineSeconds: 60
containers:
- name: busybox
image: busybox
command: ["sleep", "1m"]
restartPolicy: Never
backoffLimit: 3
podFailurePolicy:
rules:
- action: Count
onPodConditions:
- type: DisruptionTarget
- action: FailJob
onExitCodes:
operator: NotIn
values: [0]
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.1-gke.200", GitCommit:"e7a3bc760915b7368460d2ed0bd2c2568645ab70", GitTreeState:"clean", BuildDate:"2023-01-20T09:28:11Z", GoVersion:"go1.19.5 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
GKE
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 45 (40 by maintainers)
+1 This makes sense.
A couple of places will need adjustments:
EDIT: we do not touch the edge case in the examples in the user-facing documentation, but we could add a note. To be discussed.
This should solve the issue for the
In
operator. However, as you point out removal of the line changes logic for theNotIn
operator, in a non-obvious way, as a side effect.I think it makes sense to keep the current behavior for the
NotIn
operator to be backward-compatible, in that case, one solution is to exclude containers with 0 exit code here: https://github.com/kubernetes/kubernetes/blob/c9ff2866682432075da1a961bc5c3f681b34c8ea/pkg/controller/job/pod_failure_policy.go#L120 injectif exitCode == 0 { return false }
.I think in this case the pod is being marked as Failed because it exceeded the
activeDeadlineSeconds
. The reproduction script has a busybox sleep of 1 minute and the activeDeadlineSeconds is also 60 seconds. So it seems like there is a race here between the containers terminating and the active deadline seconds (and active deadline is winning).Looks like it fails with a backoff limit error.