kubernetes: Add max retries when pod's restartPolicy is RestartPolicyOnFailure

If pod’s restartPolicy is RestartPolicyOnFailure, its containers will be restarted constantly if they fail. However it might be meaningless if the failure is caused by system failure or some of user’s program internal errors. It will fail even it retries a lot of time. At the same time, we could not recognize which cases to restart or not. I propose to add max retires for the policy RestartPolicyOnFailure to pod spec, pod will fail after max retries.

/kind feature /sig apps

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 17
Comments: 63 (29 by maintainers)

Commits related to this issue

Add restartPolicy: OnFailure for kubevirtci publish Please note that there's no maxRetry property on the PodSpec currently, see https://github.com/kubernetes/kubernetes/issues/65797 for a feature req... — committed to dhiller/project-infra by kubevirt-bot a year ago

Most upvoted comments

Finally I implemented it. I add maxRetires in PodSpec.

e.g.

$ cat pod_retry.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  restartPolicy: "OnFailure"
  maxRetries: "3"                // Max retries is 3
  containers:
  - image: nginx:1.7.9
    name: test-pod
    command:
    - /bin/ls
    - hello

The above pod will retry 3 times, and failed finally. Without maxRetries, the pod will always retry, however it will not succeed and retries take a lot of time and resource.

BTW: I implemented another version using annotation kubernetes.io/maxRetriesForOnFailurePolicy. Maybe we could use it for alpha feature?

$ cat pod_retry.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
  annotations:
      kubernetes.io/maxRetriesForOnFailurePolicy: "3"           // Max retries is 3
spec:
  restartPolicy: "OnFailure"
  maxRetries: "3"
  containers:
  - image: nginx:1.7.9
    name: test-pod
    command:
    - /bin/ls
    - hello

+10

hex108 on Jul 19, 2018

Sorry all, I’ll write a KEP for it.

hex108 on Nov 7, 2019

hi @filipre, This feature just aims to solve problems similar to yours, the pod might fail because all kinds of errors, it is meaningless to always retry. This feature could reduce costs.

hex108 on Mar 27, 2019

/reopen

I created a PR https://github.com/kubernetes/kubernetes/pull/79334 for it some time ago, but it needs a KEP.

@kerthcet I would be grateful if you could follow up on it. And we could finish it togother if needed.

I’m glad to work on a KEP then, thanks. Hope to finish it in release 1.25.

kerthcet on Feb 19, 2022

can use .spec.backoffLimit https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy

wenhuanglin on Nov 9, 2020

can anyone confirm if this feature has been implemented cuz I dont see any word regarding this on official documentation of kubernetes. If its not implemented then I request the owners to kindly look into implementing this.

utkarsh2811 on Sep 28, 2020

What is the status of this feature/PR? This is something we are really interested in.

srobinson on Sep 4, 2019

Just sent a PR for it, implemented it using pod annotations. cc @filipre

hex108 on Jun 24, 2019

/reopen

hex108 on Mar 21, 2022

I’d be interested whether this feature would improve performance or reduce costs within the cluster. Sometimes we had services that restarted a thousand times before we noticed that there is something wrong.

filipre on Mar 27, 2019

This issue shouldn’t go stale before receiving at least some feedback. /remove-lifecycle stale

filipre on Oct 17, 2018