kubernetes: kubectl delete daemonset hang

I have a daemonset resource like this:

NAME             DESIRED   CURRENT   NODE-SELECTOR   AGE
newrelic-agent   3         3         <none>          3d

When I try to run kubectl delete daemonset to delete this daemonset resource , it hang a few minutes, and exited with result 1.

root@ubuntu192:~/k8s# kubectl get daemonset 
NAME             DESIRED   CURRENT   NODE-SELECTOR                                                               AGE
newrelic-agent   0         0         13e0252d-8069-11e6-9cdd-005056881537=13e02602-8069-11e6-9cdd-005056881537   3d

This daemonset still exists.I try to delete it again.This time it hang 8 minutes.

root@ubuntu192:~/k8s# kubectl delete ds newrelic-agent
error: timed out waiting for the condition

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:“1”, Minor:“4+”, GitVersion:“v1.4.0-alpha.2.282+c8ea7af912f86e”, GitCommit:“c8ea7af912f86e05e22f1e8d0d0b90c8b9fc90d7”, GitTreeState:“clean”, BuildDate:“2016-08-05T01:09:47Z”, GoVersion:“go1.6”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“4+”, GitVersion:“v1.4.0-alpha.2.282+c8ea7af912f86e”, GitCommit:“c8ea7af912f86e05e22f1e8d0d0b90c8b9fc90d7”, GitTreeState:“clean”, BuildDate:“2016-08-05T01:09:47Z”, GoVersion:“go1.6”, Compiler:“gc”, Platform:“linux/amd64”}

Environment: Linux ubuntu192.168.14.100 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:01:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

This is part of logs I strace kubectl delete daemonsetcommand. Is that some way k8s get stack strace like docker dameon?

stat(“/root/.kube/config”, {st_mode=S_IFREG|0644, st_size=332, …}) = 0 openat(AT_FDCWD, “/root/.kube/config”, O_RDONLY|O_CLOEXEC) = 5 fstat(5, {st_mode=S_IFREG|0644, st_size=332, …}) = 0 read(5, “apiVersion: v1\nclusters:\n- clust”…, 844) = 332 read(5, “”, 512) = 0 close(5) = 0 stat(“/var/run/secrets/kubernetes.io/serviceaccount/token”, 0xc820466518) = -1 ENOENT (No such file or directory) stat(“/var/run/secrets/kubernetes.io/serviceaccount/token”, 0xc8204665e8) = -1 ENOENT (No such file or directory) futex(0xc8200e0908, FUTEX_WAKE, 1) = 1 socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 5 setsockopt(5, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0 connect(5, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr(“192.168.14.100”)}, 16) = -1 EINPROGRESS (Operation now in progress) epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1007984784, u64=140308999610512}}) = 0 getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 getsockname(5, {sa_family=AF_INET, sin_port=htons(33804), sin_addr=inet_addr(“192.168.14.100”)}, [16]) = 0 getpeername(5, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr(“192.168.14.100”)}, [16]) = 0 setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(5, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 setsockopt(5, SOL_TCP, TCP_KEEPINTVL, [30], 4) = 0 setsockopt(5, SOL_TCP, TCP_KEEPIDLE, [30], 4) = 0 futex(0xc8200e0908, FUTEX_WAKE, 1) = 1 read(5, 0xc82029c000, 4096) = -1 EAGAIN (Resource temporarily unavailable) write(5, “GET /api HTTP/1.1\r\nHost: 192.168”…, 163) = 163 futex(0xc8200e0908, FUTEX_WAKE, 1) = 1 futex(0x2965d08, FUTEX_WAIT, 0, NULL) = 0 futex(0x2965d08, FUTEX_WAIT, 0, NULL) = 0 futex(0x2965d08, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0x2965d08, FUTEX_WAIT, 0, NULL) = 0 epoll_wait(4, [], 128, 0) = 0 futex(0x2964fc0, FUTEX_WAKE, 1) = 0 futex(0x2964f10, FUTEX_WAKE, 1) = 1 futex(0x2965d08, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0x2964fe8, FUTEX_WAKE, 1) = 0 futex(0x2964f10, FUTEX_WAKE, 1) = 1 futex(0xc8200e0908, FUTEX_WAKE, 1) = 1 futex(0x2965d08, FUTEX_WAIT, 0, NULL) = 0 futex(0x2965d08, FUTEX_WAIT, 0, NULL) = 0 sched_yield() = 0 futex(0x2965d08, FUTEX_WAIT, 0, NULL) = 0 futex(0x2964fe8, FUTEX_WAKE, 1) = 0 futex(0x2964f10, FUTEX_WAKE, 1) = 1 futex(0x2964fc0, FUTEX_WAKE, 1) = 1 futex(0x2964f10, FUTEX_WAKE, 1) = 1 futex(0x2965d08, FUTEX_WAIT, 0, NULL) = 0 sched_yield() = 0 futex(0x2965d08, FUTEX_WAIT, 0, NULL) = 0 select(0, NULL, NULL, NULL, {0, 100}) = 0 (Timeout) futex(0x2965d08, FUTEX_WAIT, 0, NULL) = 0 sched_yield() = 0 futex(0x2965d08, FUTEX_WAIT, 0, NULL) = 0 futex(0x2964fe8, FUTEX_WAKE, 1) = 1 futex(0x2965d08, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) sched_yield() = 0 sched_yield() = 0 futex(0x2965d08, FUTEX_WAIT, 0, NULL) = 0 futex(0x2965d08, FUTEX_WAIT, 0, NULLerror: timed out waiting for the condition <unfinished …> +++ exited with 1 +++

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 37 (22 by maintainers)

Most upvoted comments

Seeing this problem with 1.6.1.

I created the calico-node ds, it was misconfigured, so I deleted it. All the pods were terminated, but the ds won’t go away. I’ve tried --now and --force, but kubectl won’t delete the ds. The node selector has been updated to randomUUID=randomUUID.

edit: Fixed it with kubectl delete ds --force --now --cascade=false.

I hit this issue. In my case, when I deleted daemonset and it hang, some of pods were Terminating status. I’m not sure if all of reporters hit same reason with me, but here is a step to reproduce and workaround (in my case).

1. Stop one of node service and makes it NotReady

# kubectl get node
NAME                     STATUS     AGE
knakayam-ose33-smaster   Ready      37d
knakayam-ose33-snode1    NotReady   2m

2. Deploy daemonset and confirm one pod doesn’t have Running status.

# kubectl create -f log-daemonset.yaml (I tested @pbecotte's log-daemonset in above his comment.)

# kubectl get pod
NAME                              READY     STATUS              RESTARTS   AGE
log-daemonset-4z8xo               0/1       ContainerCreating   0          3m
log-daemonset-l4nxo               1/1       Running             0          5m

3. Delete daemonset and observe the issue

# kubectl delete ds log-daemonset
 ... (hang) ...

4. Confirm the pod running on NotReady host still Terminating

# kubectl get pod -o wide
NAME                              READY     STATUS              RESTARTS   AGE       IP        NODE
log-daemonset-4z8xo               0/1       Terminating         0          4m        <none>    knakayam-ose33-snode1

workaround

1. delete pod with --grace-period=0

# kubectl delete pod log-daemonset-4z8xo --grace-period=0
pod "log-daemonset-4z8xo" deleted

2. delete daemonset

# kubectl delete ds log-daemonset
daemonset "log-daemonset" deleted

@vnandha kubectl delete deletes resource cascadingly by default. DaemonSet cannot finish cascading deletion without all of its pods be gone (which is expected). You can resolve this by either forcefully deleting that pod and then deleting the DaemonSet cascadingly, or by deleting the DaemonSet with --cascade=false, and then delete its pods manually.

I’m unable to reproduce it, so just brainstorming some possible causes.

If DaemonSet deletion timeout and…

  1. if DaemonSet’s .spec.template.spec.nodeSelector isn’t updated to randomUUID=randomUUID: it’s something to do with kubectl delete
  2. else if at least one of its pods is still Running: it’s something to do with DaemonSet controller
  3. else if at least one of its pod is Terminating: it’s something to do with kubelet
  4. else if its pods are all gone, and DaemonSet’s .status.currentNumberScheduled + .status.numberMisscheduled isn’t 0: it’s something to do with DaemonSet controller
  5. else if its pods are all gone, and DaemonSet’s status is correct: it’s something to do with garbage collector