kubernetes: DaemonSet doesn't run in all nodes

Using v1.2.0-beta.1. Deployed a DaemonSet with no node selector, but it’s not running in all of them.

The two that are running are the ones with SchedulingDisabled.

$ kubectl get nodes
NAME            STATUS                     AGE
100.64.32.234   Ready                      8d
100.64.32.71    Ready,SchedulingDisabled   5m
100.64.33.77    Ready,SchedulingDisabled   19m
100.64.33.82    Ready                      2d
$ kubectl describe daemonset kube-proxy
Name:       kube-proxy
Image(s):   calpicow/hyperkube:v1.2.0-beta.1-custom
Selector:   name in (kube-proxy)
Node-Selector:  <none>
Labels:     name=kube-proxy
Desired Number of Nodes Scheduled: 2
Current Number of Nodes Scheduled: 2
Number of Nodes Misscheduled: 0
Pods Status:    2 Running / 0 Waiting / 0 Succeeded / 0 Failed

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 44 (24 by maintainers)

Commits related to this issue

Most upvoted comments

Just ran into the same issue with K8s 1.6.4. I had a node being OOD and repaired it manually, when it came back healthy the ds was not scheduled there and the ds controller did not even try that.

Fixed it using @ankon’s comment above (https://github.com/kubernetes/kubernetes/issues/23013#issuecomment-296596687)

This is issue is really bad when the ds in question is for example Calico, which is needed for the pod networking.

Seeing this with Openshift 3.7 / K8s 1.7

EDIT: root cause for me was related to a taint on some nodes.

For people ending here with a 1.5 cluster and dreading to replace nodes: it might help to just recreate the daemonset itself by using something like

kubectl get -o yaml ds NAME > ds.yml
kubectl delete --cascade=false ds NAME
kubectl apply -f ds.yml

This worked for me to bring back a missing kube2iam pod on a node. Unfortunately I don’t have the logs any more to see why it got lost in the first place.

I’m seeing a similar issue.

I had a disk full issue on a bunch of nodes (unrelated). Some nodes had the daemonsets removed. Other’s didn’t.

The issue is, once I’ve fixed this, I can’t have the nodes reschedule DaemonSets. Short of deleting the node and then restarting kubelet, which isn’t much fun.

I’m seeing this same behavior in my 1.2 cluster. I have 4 nodes in a cluster, all of which have sufficient space available, but the DS is reporting “desired” and “current” counts of 2. What’s worse is that things were properly working a few days ago when I rolled this out, but sometime in the last few days, 2 of the nodes lost their DS pods and they haven’t come back.

Just hit this problem with v1.14.1. Deployed some identical servers (apart from hostname/IP obviously) from the same configuration management but one was not getting DaemonSets scheduled on it.

Comment https://github.com/kubernetes/kubernetes/issues/23013#issuecomment-206503020 resolved the issue for us. Still strange it happened and why it only happened to one of them.

We ran into this problem with v1.6.13

The instructions in this comment make the daemonset pods start at all nodes. But even after a delete and recreate, I think the DaemonSet is still left in a wrong state

$ kubectl get ds --namespace calico-prod
NAME            DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR             AGE
calico          0         0         0         0            0           box.com/calico-pod=true   4m

Even though all the pods have started daemonset thinks there are 0 desired replicas.

@mikedanese It is still an issue for me:

I have 4 nodes, 1 master and 3 slaves.

kubectl get nodes
NAME          STATUS         AGE
api-master1   Ready,master   10d
api-node1     Ready          10d
api-node2     Ready          10d
api-node3     Ready          10d

I deploy following daemonset:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-volume-config
data:
  nginx.conf: |-
    worker_processes 1;

    events {
      worker_connections 1024;
    }

    stream {
      error_log stderr;

      resolver 127.0.0.1 ipv6=off;

      server {
        listen 80;
        proxy_pass traefik-ingress-service.default.svc.cluster.local:80;
      }

      server {
        listen 443;
        proxy_pass traefik-ingress-service.default.svc.cluster.local:443;
      }

      server {
        listen 2222;
        proxy_pass deis-router.deis.svc.cluster.local:2222;
      }
    }
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nginx-ingress-proxy
spec:
  template:
    metadata:
      labels:
        name: nginx-ingress-proxy
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: dnsmasq
        image: "janeczku/go-dnsmasq:release-1.0.5"
        args:
          - --listen
          - "127.0.0.1:53"
          - --default-resolver
          - --nameservers
          - "10.96.0.10,8.8.8.8"
          - --hostsfile=/etc/hosts
          - --verbose
        ports:
        - name: http
          containerPort: 53
          hostPort: 53
          protocol: UDP
      - image: nginx
        name: nginx
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            cpu: 400m
            memory: 300Mi
          requests:
            cpu: 200m
            memory: 200Mi
        volumeMounts:
        - mountPath: /etc/nginx
          name: config
          readOnly: false
        ports:
        - name: http
          containerPort: 80
          hostPort: 80
          protocol: TCP
        - name: https
          containerPort: 443
          hostPort: 443
          protocol: TCP
        - name: builder
          containerPort: 2222
          hostPort: 2222
          protocol: TCP
      hostNetwork: true
      restartPolicy: Always
      securityContext: {}
      volumes:
      - name: config
        configMap:
          name: nginx-volume-config
          items:
          - key: nginx.conf
            path: nginx.conf

It deploys, but has DESIRED set to 3, and not 4:

NAME                  DESIRED   CURRENT   READY     NODE-SELECTOR   AGE
nginx-ingress-proxy   3         3         3         <none>          5m

What’s the weirdest of all, it deploys on master node and just two slaves…

NAME                                          READY     STATUS    RESTARTS   AGE       IP          NODE
nginx-ingress-proxy-0958x                     2/2       Running   0          6m        10.0.1.4    api-node3
nginx-ingress-proxy-r3dcs                     2/2       Running   0          6m        10.0.1.6    api-node2
nginx-ingress-proxy-zk50w                     2/2       Running   0          6m        10.0.1.7    api-master1

1.2.2 seems to be stable for me - my DaemonSets are still running pods on each node after a few days.

Ok - more troubleshooting with Kelsey on slack…deleting the problem nodes by hand, then restarting kubelet on those nodes seemed to fix the issue. The DS scheduled the remaining node once the kubelet had registered itself. Guessing a bad cache somewhere.

kubectl delete node <node-name>
ssh <node ip> sudo systemctl restart kube-kubelet