kubernetes: Kubernetes controller manager may fail to manage deployment rollout with minReadySeconds and pod restarts

What happened?

kubectl rollout restart deployment alpine ; kubectl rollout status deployment alpine

may exceed the deployment rollout progress deadlines if the deployment sets minReadySeconds and a new pod restarts during the rollout before the minimum ready seconds.

What did you expect to happen?

Deployment rollout is successful.

How can we reproduce it (as minimally and precisely as possible)?

Apply the following deployment on a 3 node cluster. We are using 3 nodes here to simulate a multi zone cluster that spreads pods across zones.

kind: Deployment
apiVersion: apps/v1
metadata:
  labels:
    app: alpine
  name: alpine
spec:
  minReadySeconds: 30
  replicas: 3
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: alpine
  strategy:
    rollingUpdate:
      maxSurge: 3
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: alpine
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values: ["alpine"]
            topologyKey: kubernetes.io/hostname
      containers:
      - image: alpine:latest
        imagePullPolicy: IfNotPresent
        name: alpine
        command: ["sh", "-c", "if [[ ! -e /tmp/okay ]]; then touch /tmp/okay; sleep 15; exit; else sleep 100000; fi"]
        volumeMounts:
        - mountPath: /tmp
          name: tmp
      volumes:
      - name: tmp
        emptyDir: {}

then run kubectl rollout restart deployment alpine ; kubectl rollout status deployment alpine.

Anything else we need to know?

Removing minReadySeconds from the example deployment will yield a successful rollout. In addition, finding the Kubernetes controller manager leader via kubectl get leases -n kube-system and killing the leader will also allow the rollout to be successful. The rollout will also succeed if the old replicaset is scaled down to zero replicas.

Kubernetes version

Failure occurs on Kubernetes versions 1.20, 1.21, 1.22, 1.23 and 1.24.

Cloud provider

N/A

OS version

N/A

Install tools

IBM Cloud Kubernetes Service

Container runtime (CRI) and and version (if applicable)

N/A

Related plugins (CNI, CSI, …) and versions (if applicable)

N/A

About this issue

Original URL
State: open
Created 2 years ago
Comments: 19 (5 by maintainers)

Commits related to this issue

Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 — committed to rtheis/ibm-roks-toolkit by rtheis 2 years ago
Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 — committed to openshift-cherrypick-robot/ibm-roks-toolkit by rtheis 2 years ago
Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 — committed to openshift-cherrypick-robot/ibm-roks-toolkit by rtheis 2 years ago
Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 — committed to rtheis/ibm-roks-toolkit by rtheis 2 years ago
Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 — committed to rtheis/ibm-roks-toolkit by rtheis 2 years ago
Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 — committed to rtheis/ibm-roks-toolkit by rtheis 2 years ago
[release-4.8] Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 This is a manual cherry-... — committed to rtheis/ibm-roks-toolkit by rtheis 2 years ago
[release-4.7] Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 This is a manual cherry-... — committed to rtheis/ibm-roks-toolkit by rtheis 2 years ago
[release-4.6] Remove minReadySeconds from deployments Remove minReadySeconds from deployments until [1] is fixed. [1] https://github.com/kubernetes/kubernetes/issues/108266 This is a manual cherry-... — committed to rtheis/ibm-roks-toolkit by rtheis 2 years ago

Most upvoted comments

Replica sets

As mentioned, the most problematic parts of the issue mainly apply to replica sets.

The timeline of a problematic pod (there needs to be other unready pods for this to occurr)

0s    pod NotReady -> Ready (check for available scheduled in 30s)
15s   pod Ready -> NotReady
15.5s pod NotReady -> Ready (wishing for a check for available in 30s, but only previous check in 15s is kept and no new check is scheduled)
30s   replica set checks the replica for availability, which it is not
--- no other checks will occur
12h   resync period occurs: the replica is marked as available, but we will need to wait another 12h for the next replica check, which is not that great...

We are using delaying queue to schedule availability checks. The problem is that once we schedule availability check we cannot postpone it in the future (when pod readiness changes). Delaying queue only allows speeding of the items that should be processed: https://github.com/kubernetes/kubernetes/blob/d62cc3dc6d5c07fea79eafd866ac7e1217000ea8/pkg/controller/replicaset/replica_set.go#L464

https://github.com/kubernetes/kubernetes/blob/f536b3cc4fb8e396086bc6a0108018a783bf3cad/staging/src/k8s.io/client-go/util/workqueue/delaying_queue.go#L272

But we need the opposite. We either want to postpone the readiness check or delete the past readiness check and schedule a new one.

This also happens with more pods where some pods steal the availability scheduling window of others (pods do not have to flip readiness multiple times in this case).

Stateful sets and daemon sets

This issue manifests also in stateful set and daemon set but to a lesser extent since we will schedule available check eventually but at an incorrect time. These controllers are forcefully running checks for availability each minReadySeconds period and thus it can take up to minReadySeconds additional time before the pod availability is recognized.

https://github.com/kubernetes/kubernetes/blob/a27a323419a52b0b287ee1bdb4f3339b03ade798/pkg/controller/statefulset/stateful_set.go#L488

This could hurt the progress of the stateful set especially when working with large minReadySeconds values.

Not a real workaround

It is possible to sort of workaround this issue by setting --min-resync-period to a lower value in kube-controller-manager but this will impact performance of all the controllers.

Proposing a fix in https://github.com/kubernetes/kubernetes/pull/113605

atiratree on Nov 3, 2022