kubernetes: daemonset wrongly reports unavailable pods

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened: kubectl rollout status on a daemonset sometimes is stuck forever. The daemonset’s status reports unavailable pods even when all pods are running and ready.

$ kubectl -n monitoring rollout status ds <redacted>
Waiting for rollout to finish: 1 of 2 updated pods are available...

Here’s the status section of the daemonset:

status:
  currentNumberScheduled: 2
  desiredNumberScheduled: 2
  numberAvailable: 1
  numberMisscheduled: 0
  numberReady: 2
  numberUnavailable: 1
  observedGeneration: 1
  updatedNumberScheduled: 2

Here’s the status sections of the pods:

  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: 2017-09-25T21:13:05Z
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: 2017-09-25T21:16:14Z
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: 2017-09-25T21:14:24Z
      status: "True"
      type: PodScheduled
 status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: 2017-09-25T21:13:04Z
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: 2017-09-25T21:16:02Z
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: 2017-09-25T21:14:28Z
      status: "True"
      type: PodScheduled
$ date
Mon Sep 25 22:18:39 UTC 2017

What you expected to happen: kubectl rollout status should exit successfully when the rollout is complete. The daemonset should report all pods as available.

How to reproduce it (as minimally and precisely as possible): Can’t reproduce it reliably, but it happened with this simple daemonset:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  annotations: <redacted>
  name: <redacted>
  namespace: <redacted>
spec:
  minReadySeconds: 30
  template:
    metadata:
      annotations: <redacted>
      labels: <redacted>
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-nodepool
                operator: NotIn
                values:
                - <redacted>
      containers:
      - args: <redacted>
        image: <redacted>
        imagePullPolicy: Always
        name: <redacted>
        ports: <redacted>
        resources: {}
        volumeMounts: <redacted>
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true
      volumes:
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.labels
            path: labels
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
        name: podinfo
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 3
    type: RollingUpdate

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T09:14:02Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T08:56:23Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration**: GKE
  • OS (e.g. from /etc/os-release): COS
  • Kernel (e.g. uname -a):
  • Install tools: GKE
  • Others:

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 7
  • Comments: 36 (13 by maintainers)

Most upvoted comments

Observed in one of our clusters as well, where the reported number of available/ready pods seems to be wrong.

This happened to 2 other DaemonSets running in this cluster as well. We “fixed” one of them by changing the DaemonSet pod spec which triggered a rolling update. After the update the counts were correct. We’ve not been able to find any misbehaving node or other signs that indicate something is wrong with the cluster.

The DaemonSet has been in this state for 3-4 hours now.

some information:

$ kubectl get ds rook-agent
NAME         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
rook-agent   38        38        37      38           37          <none>          521d
$ kubectl get po -l app=rook-agent
NAME               READY   STATUS    RESTARTS   AGE
rook-agent-2mt2k   1/1     Running   5          28d
rook-agent-4bmp8   1/1     Running   5          28d
rook-agent-4c6sv   1/1     Running   9          68d
rook-agent-4sljf   1/1     Running   5          28d
rook-agent-4sr27   1/1     Running   23         384d
rook-agent-56xxw   1/1     Running   9          68d
rook-agent-589vv   1/1     Running   9          68d
rook-agent-5t9dc   1/1     Running   5          28d
rook-agent-7bl2w   1/1     Running   9          69d
rook-agent-9wwhj   1/1     Running   9          69d
rook-agent-b7qb9   1/1     Running   13         149d
rook-agent-c5w5z   1/1     Running   13         149d
rook-agent-dhfzd   1/1     Running   9          68d
rook-agent-ff28t   1/1     Running   5          28d
rook-agent-fmlcs   1/1     Running   23         417d
rook-agent-gpk6b   1/1     Running   9          68d
rook-agent-jt5vv   1/1     Running   26         384d
rook-agent-jtx9v   1/1     Running   9          69d
rook-agent-kpzmp   1/1     Running   9          68d
rook-agent-m2kqn   1/1     Running   9          69d
rook-agent-mtdmr   1/1     Running   13         149d
rook-agent-mvhgp   1/1     Running   9          69d
rook-agent-mw42l   1/1     Running   9          68d
rook-agent-ndd68   1/1     Running   5          28d
rook-agent-pqkhk   1/1     Running   9          69d
rook-agent-q2xbp   1/1     Running   5          28d
rook-agent-ssnm2   1/1     Running   9          68d
rook-agent-tvwj2   1/1     Running   23         417d
rook-agent-tzxhj   1/1     Running   9          68d
rook-agent-v2hds   1/1     Running   14         149d
rook-agent-v4l58   1/1     Running   9          68d
rook-agent-v8htx   1/1     Running   5          28d
rook-agent-w9qmc   1/1     Running   9          68d
rook-agent-x7t4d   1/1     Running   9          68d
rook-agent-xjgm6   1/1     Running   9          69d
rook-agent-zczhh   1/1     Running   9          69d
rook-agent-zvb28   1/1     Running   9          69d
rook-agent-zvqsh   1/1     Running   9          69d
$ kubectl get po -l app=rook-agent --no-headers | wc -l
38
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.2", GitCommit:"f6278300bebbb750328ac16ee6dd3aa7d3549568", GitTreeState:"clean", BuildDate:"2019-08-05T09:15:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
$ kubectl describe ds rook-agent
Name:           rook-agent
Selector:       app=rook-agent
Node-Selector:  <none>
Labels:         app=rook-agent
Annotations:    deprecated.daemonset.template.generation: 2
Desired Number of Nodes Scheduled: 38
Current Number of Nodes Scheduled: 38
Number of Nodes Scheduled with Up-to-date Pods: 38
Number of Nodes Scheduled with Available Pods: 37
Number of Nodes Misscheduled: 0
Pods Status:  38 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=rook-agent
  Service Account:  rook-agent
  Containers:
   rook-agent:
    Image:      rook/rook:v0.7.1
    Port:       <none>
    Host Port:  <none>
    Args:
      agent
    Environment:
      POD_NAMESPACE:   (v1:metadata.namespace)
      NODE_NAME:       (v1:spec.nodeName)
    Mounts:
      /dev from dev (rw)
      /flexmnt from flexvolume (rw)
      /lib/modules from libmodules (rw)
      /sys from sys (rw)
  Volumes:
   flexvolume:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/volumeplugins
    HostPathType:
   dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:                                                                                                                                                                                                                                                                               sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:
   libmodules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:
Events:            <none>

Please tell me if I can provide any more information.

i have the exact same here. K8s 1.15.3. I can fix every DaemonSet with that issue by running kubectl rollout restart ds <DaemonSetName>. It then get’s rerolled and all is fine.

@bbgobie I know, right!? It would be nice if it would tell us which pod is “not ready” in its mind. (even though they all report as ready). We just ended up restarting all the pods as per my message a few posts up. I have not seen any problems (with that) on our 1.16.3 cluster.

/reopen

Third times the charm?