kubernetes: ReplicaSet controller continuously creating pods failing due to SysctlForbidden
What happened:
Creating a deployment with a pod spec with an unsafe sysctl not whitelisted results in the pods being rejected by Kubelet with SysctlForbidden
and the replicatset controller just keeps creating new pods in a tight loop.
What you expected to happen:
The replicaset controller should maybe backoff exponentially if the pod it tries creating is getting rejected by Kubelet
How to reproduce it (as minimally and precisely as possible):
Add:
securityContext:
sysctls:
- name: net.core.somaxconn
value: '10000' # This value really does not matter any int works
to a pod spec in a deployment, ensure kubelets in the cluster do not have allowed-unsafe-sysctls
flag configured.
apply the deployment config and you should see over a 100 pods created under a minute.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
):
kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:57:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:46:57Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: AWS
- OS (e.g. from /etc/os-release): Debian Stretch
- Kernel (e.g.
uname -a
): Linux 4.9
/kind bug
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 29
- Comments: 55 (38 by maintainers)
đź‘‹ chiming in from the sidelines over here as someone who was recently impacted by this bug and the subsequent cascading failure of the control plane due to creation of 15k Pod objects in a few minutes. (The result of that may or may not have made it difficult for people to load this very page :octocat: :trollface:).
This problem is somewhat serious in that it creates a Denial of Service vector for anyone who has access to deploy resources to the cluster. I haven’t been able to figure out a way to even mitigate it, as
ResourceQuota
are also impacted by the race condition that exists due to not consideringFailed
pods as counting toward the quota.My understanding of the problem is:
ReplicaSetController
finds not enough Pods, so creates themResourceQuota
admission control ignores Failed and Succeeded Pods so admits new Podskube-scheduler
schedules Pending pods onto a Nodekubelet
says “nope, I don’t support that sysctl”, marks Pods as “Failed”GOTO 1
To my đź‘€, it seems this is primarily a scheduling problem. If the scheduler had the right information to know about the capabilities of the Node, Pods would stay in
Pending
and the cycle would be broken. Backoff seems warranted as a fallback, but doesn’t really solve or even mitigate the problem. It will make it take longer before the control-plane melts down, though.It’s only peripherally related to this issue, but
--allowed-unsafe-sysctls
being akubelet
parameter feels a little off to me and seems like would be better suited as a cluster-wide parameter (perhaps at thekube-apiserver
level or something?). I’m challenged trying to think of a case where I would want to node scope this parameter particularly since there’s no way to tell the scheduler about it that scoping.❤️ to those looking at this issue.
This Issue Resolved for me with Doing This Steps:
We must allow “allowedUnsafeSysctls:” in kubelet configuration of ALL worker nodes:
Then restart kubelet service for apply change:
systemctl restart kubelet
I don’t want to add special logic to all workloads controllers to deal with the scheduler placing Pods on Nodes were they can’t possibly run. Controller authors (imo) should not have to reason about the Scheduler’s behavior wrt to placement. This leads to a poor separation of concerns (imo). With the work that we’re doing in extensiblilty to allow users to write custom workload controllers (e.g. Operators) this propagates the concern even more broadly.
I concur. This seems still a very relevant issue worth keeping on the radar, but I think better to file a more actionable issue for node once we reach consensus /remove-sig node
I think a mechanism should be added for kubelet to convey to scheduler that given pod spec cannot be satisfied on certain nodes. SysctlForbidden is one such case. CPU affinity management is another case.
Pardon me if I’m asking the obvious thing: why can’t replicaSet controller just delete failed pods it created? Or when creating new pods, considering the failed pods, and DO NOT create new one when there is failed pods?
I think kubelet can provide a consistent message for pods like this, whether we decide to use softAdmitHandlers or admitHandlers, so that the controllers may have a chance to act differently. Right now, the soft admission handlers provide a “Blocked” reason. If we continue using the non-soft admission handler, perhaps we can provide a more generic reason?
SysctlForbidden
is too specific, and the controllers cannot keep appending the list of reasons to watch.Also this likely goes beyond just
SysctlForbidden
case; we need scheduler decisions to match kubelet admission when binding a node or this can reappear in another form.Stateful controller also has similar code which deletes failed pods and recreates them. I guess the thing to think about here is
/assign @surajssd
It would probably be helpful to sketch out the approach you plan to take here to make sure it makes sense from the node and controller’s perspective
thanks for the details @surajssd I have added this issue to the open discussion agenda for sig-apps tomorrow. Please attend if possible. /area kubelet
I meant GKE doesn’t show anything above 1.11.x As per this https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/
A pod with the unsafe sysctls will fail to launch on any node which has not enabled unsafe sysctls.
Failure to launch a pod, in my understanding should not create new ones. E.g. if i create a new pod with an invalid image, the pod remains in
ErrImagePull
state.quota only prevents absolute numbers of pods. it does not prevent high create/delete/recreate rates
As an interim safeguard, maybe you could configure a quota?
Adding in node and scheduling to comment on the proposed solution. /sig node /sig scheduling
This issue was raised with sig scheduling. Unfortunately, the ReplicaSetController, Scheduler, and kubelet can all be said to be behaving correctly here. Same goes for admission.
SIG Apps can comment on whether it makes sense to put backoff mechanisms into each controller. The scheduler has a backoff mechanism for a pod that cannot be scheduled, but we do not believe the scheduler can/should understand when a new pod is a replacement for a terminated pod.
Having a check in admission probably doesn’t make sense because different nodes can have different sysctl whitelists. In fact, because the sysctl whitelist is a command line flag to kubelet, controllers cannot make decisions based on sysctl requests.
One proposed solution:
SysctlForbidden
reason.