kubernetes: Add max retries when pod's restartPolicy is RestartPolicyOnFailure

If pod’s restartPolicy is RestartPolicyOnFailure, its containers will be restarted constantly if they fail. However it might be meaningless if the failure is caused by system failure or some of user’s program internal errors. It will fail even it retries a lot of time. At the same time, we could not recognize which cases to restart or not. I propose to add max retires for the policy RestartPolicyOnFailure to pod spec, pod will fail after max retries.

/kind feature /sig apps

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 17
  • Comments: 63 (29 by maintainers)

Commits related to this issue

Most upvoted comments

Finally I implemented it. I add maxRetires in PodSpec.

e.g.

$ cat pod_retry.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  restartPolicy: "OnFailure"
  maxRetries: "3"                // Max retries is 3
  containers:
  - image: nginx:1.7.9
    name: test-pod
    command:
    - /bin/ls
    - hello

The above pod will retry 3 times, and failed finally. Without maxRetries, the pod will always retry, however it will not succeed and retries take a lot of time and resource.

BTW: I implemented another version using annotation kubernetes.io/maxRetriesForOnFailurePolicy. Maybe we could use it for alpha feature?

$ cat pod_retry.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
  annotations:
      kubernetes.io/maxRetriesForOnFailurePolicy: "3"           // Max retries is 3
spec:
  restartPolicy: "OnFailure"
  maxRetries: "3"
  containers:
  - image: nginx:1.7.9
    name: test-pod
    command:
    - /bin/ls
    - hello

Sorry all, I’ll write a KEP for it.

hi @filipre, This feature just aims to solve problems similar to yours, the pod might fail because all kinds of errors, it is meaningless to always retry. This feature could reduce costs.

/reopen

I created a PR https://github.com/kubernetes/kubernetes/pull/79334 for it some time ago, but it needs a KEP.

@kerthcet I would be grateful if you could follow up on it. And we could finish it togother if needed.

I’m glad to work on a KEP then, thanks. Hope to finish it in release 1.25.

can anyone confirm if this feature has been implemented cuz I dont see any word regarding this on official documentation of kubernetes. If its not implemented then I request the owners to kindly look into implementing this.

What is the status of this feature/PR? This is something we are really interested in.

Just sent a PR for it, implemented it using pod annotations. cc @filipre

/reopen

I’d be interested whether this feature would improve performance or reduce costs within the cluster. Sometimes we had services that restarted a thousand times before we noticed that there is something wrong.

This issue shouldn’t go stale before receiving at least some feedback. /remove-lifecycle stale