kubernetes: Daemonset controller doesn't retry creating pods after a timeout error
What happened:
Same issue as in #67662, but in the daemonset controller. Two pod creations failed because of a timeout and then the creation was not retried neither immediately nor after 5 minutes. The pods were finally created 45 minutes later when (I assume) creation of another node nudged the daemonset controller to finally do its job. As a result, two nodes were left without networking, logging and other important services. Here are the events ~25 minutes after the issue (repeated twice because of 2 nodes and a restart of controller-manager in the hopes that it’d fix it):
Normal SuccessfulCreate 25m daemonset-controller Created pod: kube-flannel-ggq5h
Warning FailedCreate 24m (x2 over 24m) daemonset-controller Error creating: the server was unable to return a response in the time allotted, but may still be processing the request (post pods)
Warning FailedCreate 15m (x2 over 16m) daemonset-controller Error creating: the server was unable to return a response in the time allotted, but may still be processing the request (post pods)
And here’s the same thing 45 minutes after when a creation of a new node finally prompted the controller to do its job:
Warning FailedCreate 59m (x2 over 59m) daemonset-controller Error creating: the server was unable to return a response in the time allotted, but may still be processing the request (post pods)
Warning FailedCreate 50m (x2 over 51m) daemonset-controller Error creating: the server was unable to return a response in the time allotted, but may still be processing the request (post pods)
Normal SuccessfulCreate 4m daemonset-controller Created pod: kube-flannel-h6mfb
Normal SuccessfulCreate 4m daemonset-controller Created pod: kube-flannel-7f5ql
Normal SuccessfulCreate 4m daemonset-controller Created pod: kube-flannel-2pcr9
What you expected to happen:
Daemonset controller retries pod creation after a reasonable time even if it gets a timeout.
How to reproduce it (as minimally and precisely as possible):
Make pod creations timeout (e.g. by providing a bad webhook, for us it was the VPA admission controller). Start a new node, observe that the node doesn’t run some of the daemonset pods. Wait for 5 minutes, observe that the node still doesn’t run anything.
Anything else we need to know?:
Environment:
-
Kubernetes version (use
kubectl version
):Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T18:02:47Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.5", GitCommit:"753b2dbc622f5cc417845f0ff8a77f539a4213ea", GitTreeState:"clean", BuildDate:"2018-11-26T14:31:35Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
-
Cloud provider or hardware configuration: AWS
-
OS (e.g. from /etc/os-release): CoreOS 1800.7.0
-
Kernel (e.g.
uname -a
): 4.14.63-coreos -
Install tools: kubernetes-on-aws
/kind bug.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 7
- Comments: 47 (10 by maintainers)
is this fixed in https://github.com/kubernetes/kubernetes/pull/86365 ?
Same here. As a temp workaround you can make a dummy edit on the daemonset, and it will trigger it to try again.