kubernetes: Daemonset controller doesn't retry creating pods after a timeout error

What happened:

Same issue as in #67662, but in the daemonset controller. Two pod creations failed because of a timeout and then the creation was not retried neither immediately nor after 5 minutes. The pods were finally created 45 minutes later when (I assume) creation of another node nudged the daemonset controller to finally do its job. As a result, two nodes were left without networking, logging and other important services. Here are the events ~25 minutes after the issue (repeated twice because of 2 nodes and a restart of controller-manager in the hopes that it’d fix it):

Normal   SuccessfulCreate  25m                daemonset-controller  Created pod: kube-flannel-ggq5h
Warning  FailedCreate      24m (x2 over 24m)  daemonset-controller  Error creating: the server was unable to return a response in the time allotted, but may still be processing the request (post pods)
Warning  FailedCreate      15m (x2 over 16m)  daemonset-controller  Error creating: the server was unable to return a response in the time allotted, but may still be processing the request (post pods)

And here’s the same thing 45 minutes after when a creation of a new node finally prompted the controller to do its job:

Warning  FailedCreate      59m (x2 over 59m)  daemonset-controller  Error creating: the server was unable to return a response in the time allotted, but may still be processing the request (post pods)
Warning  FailedCreate      50m (x2 over 51m)  daemonset-controller  Error creating: the server was unable to return a response in the time allotted, but may still be processing the request (post pods)
Normal   SuccessfulCreate  4m                 daemonset-controller  Created pod: kube-flannel-h6mfb
Normal   SuccessfulCreate  4m                 daemonset-controller  Created pod: kube-flannel-7f5ql
Normal   SuccessfulCreate  4m                 daemonset-controller  Created pod: kube-flannel-2pcr9

What you expected to happen:

Daemonset controller retries pod creation after a reasonable time even if it gets a timeout.

How to reproduce it (as minimally and precisely as possible):

Make pod creations timeout (e.g. by providing a bad webhook, for us it was the VPA admission controller). Start a new node, observe that the node doesn’t run some of the daemonset pods. Wait for 5 minutes, observe that the node still doesn’t run anything.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T18:02:47Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.5", GitCommit:"753b2dbc622f5cc417845f0ff8a77f539a4213ea", GitTreeState:"clean", BuildDate:"2018-11-26T14:31:35Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): CoreOS 1800.7.0
Kernel (e.g. uname -a): 4.14.63-coreos
Install tools: kubernetes-on-aws

/kind bug.

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 7
Comments: 47 (10 by maintainers)

Most upvoted comments

is this fixed in https://github.com/kubernetes/kubernetes/pull/86365 ?

pigletfly on Jun 12, 2020

Same here. As a temp workaround you can make a dummy edit on the daemonset, and it will trigger it to try again.

sharkymcdongles on Aug 19, 2019