kubernetes: Pod Failure Policy Edge Case: Job Retries When Pod Finishes Successfully

What happened?

I’ve noticed an edge case when trying out the PodFailurePolicy example from Story 3 of the KEP. When a pod is marked as a DisruptionTarget but then completes successful, it retries anyways. I wasn’t able to add some sort of Ignore rule to exclude the 0 case

What did you expect to happen?

I’m not sure what the expected behavior should be, but I expected that the job should succeed.

How can we reproduce it (as minimally and precisely as possible)?

Here’s a job that reproduces the issue 99% of the time

apiVersion: batch/v1
kind: Job
metadata:
  name: test-job
spec:
  template:
    spec:
      activeDeadlineSeconds: 60
      containers:
        - name: busybox
          image: busybox
          command: ["sleep", "1m"]
      restartPolicy: Never
  backoffLimit: 3
  podFailurePolicy:
    rules:
      - action: Count
        onPodConditions:
          - type: DisruptionTarget
      - action: FailJob
        onExitCodes:
          operator: NotIn
          values: [0]

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.1-gke.200", GitCommit:"e7a3bc760915b7368460d2ed0bd2c2568645ab70", GitTreeState:"clean", BuildDate:"2023-01-20T09:28:11Z", GoVersion:"go1.19.5 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

GKE

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

Original URL
State: open
Created a year ago
Comments: 45 (40 by maintainers)

Most upvoted comments

I think we should maintain consistency and make it match 0. Since the feature is beta we are allowed to change the behavior, given the proper release note and documentation updates.

+1 This makes sense.

I didn’t find where to update documentation.

A couple of places will need adjustments:

The API field description:

KEP:

https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures#jobspec-api
https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures#story-3 here I think we could extend the example with (along with a comment):

    - action: FailJob
     onExitCodes:
       operator: In
       values: [0]

user-facing documentation (but it does not seem to be affected by the change):

EDIT: we do not touch the edge case in the examples in the user-facing documentation, but we could add a note. To be discussed.

mimowo on Mar 22, 2023

3. And how does this behavior help with this issue?

This should solve the issue for the In operator. However, as you point out removal of the line changes logic for the NotIn operator, in a non-obvious way, as a side effect.

I think it makes sense to keep the current behavior for the NotIn operator to be backward-compatible, in that case, one solution is to exclude containers with 0 exit code here: https://github.com/kubernetes/kubernetes/blob/c9ff2866682432075da1a961bc5c3f681b34c8ea/pkg/controller/job/pod_failure_policy.go#L120 inject if exitCode == 0 { return false }.

mimowo on Mar 21, 2023

Ah ok, so this seems to be WAI from the job controller perspective. Not sure why a pod would be marked as Failed if it exits with exit code 0. @bobbypage, can you provide some context?

/sig node

I think in this case the pod is being marked as Failed because it exceeded the activeDeadlineSeconds. The reproduction script has a busybox sleep of 1 minute and the activeDeadlineSeconds is also 60 seconds. So it seems like there is a race here between the containers terminating and the active deadline seconds (and active deadline is winning).

bobbypage on Feb 10, 2023

In your yaml, it seems that this is a single pod Job. Since you say the pod is succeeding, is the job marked as succeeded even after creating a replacement pod? This can help us understand whether there is an additional problem.

Looks like it fails with a backoff limit error.

apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    batch.kubernetes.io/job-tracking: ""
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{},"name":"test-job","namespace":"default"},"spec":{"backoffLimit":3,"podFailurePolicy":{"rules":[{"action":"Count","onPodConditions":[{"type":"DisruptionTarget"}]},{"action":"FailJob","onExitCodes":{"operator":"NotIn","values":[0]}}]},"template":{"spec":{"activeDeadlineSeconds":60,"containers":[{"command":["sleep","1m"],"image":"busybox","name":"busybox"}],"restartPolicy":"Never"}}}}
  creationTimestamp: "2023-02-10T16:52:04Z"
  generation: 1
  labels:
    controller-uid: 74b33f35-dd4c-4504-9c11-13ae7f99356e
    job-name: test-job
  name: test-job
  namespace: default
  resourceVersion: "25103390"
  uid: 74b33f35-dd4c-4504-9c11-13ae7f99356e
spec:
  backoffLimit: 3
  completionMode: NonIndexed
  completions: 1
  parallelism: 1
  podFailurePolicy:
    rules:
    - action: Count
      onExitCodes: null
      onPodConditions:
      - status: "True"
        type: DisruptionTarget
    - action: FailJob
      onExitCodes:
        containerName: null
        operator: NotIn
        values:
        - 0
      onPodConditions: null
  selector:
    matchLabels:
      controller-uid: 74b33f35-dd4c-4504-9c11-13ae7f99356e
  suspend: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        controller-uid: 74b33f35-dd4c-4504-9c11-13ae7f99356e
        job-name: test-job
    spec:
      activeDeadlineSeconds: 60
      containers:
      - command:
        - sleep
        - 1m
        image: busybox
        imagePullPolicy: Always
        name: busybox
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  conditions:
  - lastProbeTime: "2023-02-10T16:57:21Z"
    lastTransitionTime: "2023-02-10T16:57:21Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 4
  ready: 0
  startTime: "2023-02-10T16:52:04Z"
  uncountedTerminatedPods: {}

jensentanlo on Feb 10, 2023