kubernetes: Kubernetes controller manager may fail to manage deployment rollout with minReadySeconds and pod restarts
What happened?
kubectl rollout restart deployment alpine ; kubectl rollout status deployment alpine
may exceed the deployment rollout progress deadlines if the deployment sets minReadySeconds
and a new pod restarts during the rollout before the minimum ready seconds.
What did you expect to happen?
Deployment rollout is successful.
How can we reproduce it (as minimally and precisely as possible)?
Apply the following deployment on a 3 node cluster. We are using 3 nodes here to simulate a multi zone cluster that spreads pods across zones.
kind: Deployment
apiVersion: apps/v1
metadata:
labels:
app: alpine
name: alpine
spec:
minReadySeconds: 30
replicas: 3
revisionHistoryLimit: 2
selector:
matchLabels:
app: alpine
strategy:
rollingUpdate:
maxSurge: 3
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: alpine
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: ["alpine"]
topologyKey: kubernetes.io/hostname
containers:
- image: alpine:latest
imagePullPolicy: IfNotPresent
name: alpine
command: ["sh", "-c", "if [[ ! -e /tmp/okay ]]; then touch /tmp/okay; sleep 15; exit; else sleep 100000; fi"]
volumeMounts:
- mountPath: /tmp
name: tmp
volumes:
- name: tmp
emptyDir: {}
then run kubectl rollout restart deployment alpine ; kubectl rollout status deployment alpine
.
Anything else we need to know?
Removing minReadySeconds
from the example deployment will yield a successful rollout. In addition, finding the Kubernetes controller manager leader via kubectl get leases -n kube-system
and killing the leader will also allow the rollout to be successful. The rollout will also succeed if the old replicaset is scaled down to zero replicas.
Kubernetes version
Failure occurs on Kubernetes versions 1.20, 1.21, 1.22, 1.23 and 1.24.
Cloud provider
N/A
OS version
N/A
Install tools
IBM Cloud Kubernetes Service
Container runtime (CRI) and and version (if applicable)
N/A
Related plugins (CNI, CSI, …) and versions (if applicable)
N/A
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 19 (5 by maintainers)
Commits related to this issue
- Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 — committed to rtheis/ibm-roks-toolkit by rtheis 2 years ago
- Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 — committed to openshift-cherrypick-robot/ibm-roks-toolkit by rtheis 2 years ago
- Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 — committed to openshift-cherrypick-robot/ibm-roks-toolkit by rtheis 2 years ago
- Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 — committed to rtheis/ibm-roks-toolkit by rtheis 2 years ago
- Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 — committed to rtheis/ibm-roks-toolkit by rtheis 2 years ago
- Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 — committed to rtheis/ibm-roks-toolkit by rtheis 2 years ago
- [release-4.8] Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 This is a manual cherry-... — committed to rtheis/ibm-roks-toolkit by rtheis 2 years ago
- [release-4.7] Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 This is a manual cherry-... — committed to rtheis/ibm-roks-toolkit by rtheis 2 years ago
- [release-4.6] Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 This is a manual cherry-... — committed to rtheis/ibm-roks-toolkit by rtheis 2 years ago
Replica sets
As mentioned, the most problematic parts of the issue mainly apply to replica sets.
The timeline of a problematic pod (there needs to be other unready pods for this to occurr)
We are using delaying queue to schedule availability checks. The problem is that once we schedule availability check we cannot postpone it in the future (when pod readiness changes). Delaying queue only allows speeding of the items that should be processed: https://github.com/kubernetes/kubernetes/blob/d62cc3dc6d5c07fea79eafd866ac7e1217000ea8/pkg/controller/replicaset/replica_set.go#L464
https://github.com/kubernetes/kubernetes/blob/f536b3cc4fb8e396086bc6a0108018a783bf3cad/staging/src/k8s.io/client-go/util/workqueue/delaying_queue.go#L272
But we need the opposite. We either want to postpone the readiness check or delete the past readiness check and schedule a new one.
This also happens with more pods where some pods steal the availability scheduling window of others (pods do not have to flip readiness multiple times in this case).
Stateful sets and daemon sets
This issue manifests also in stateful set and daemon set but to a lesser extent since we will schedule available check eventually but at an incorrect time. These controllers are forcefully running checks for availability each minReadySeconds period and thus it can take up to minReadySeconds additional time before the pod availability is recognized.
https://github.com/kubernetes/kubernetes/blob/a27a323419a52b0b287ee1bdb4f3339b03ade798/pkg/controller/statefulset/stateful_set.go#L488
This could hurt the progress of the stateful set especially when working with large minReadySeconds values.
Not a real workaround
It is possible to sort of workaround this issue by setting
--min-resync-period
to a lower value in kube-controller-manager but this will impact performance of all the controllers.Proposing a fix in https://github.com/kubernetes/kubernetes/pull/113605