kubernetes: ReplicaSet controller continuously creating pods failing due to SysctlForbidden

What happened:

Creating a deployment with a pod spec with an unsafe sysctl not whitelisted results in the pods being rejected by Kubelet with SysctlForbidden and the replicatset controller just keeps creating new pods in a tight loop.

What you expected to happen:

The replicaset controller should maybe backoff exponentially if the pod it tries creating is getting rejected by Kubelet

How to reproduce it (as minimally and precisely as possible):

Add:

securityContext:            
  sysctls:                  
  - name: net.core.somaxconn
    value: '10000'   # This value really does not matter any int works       

to a pod spec in a deployment, ensure kubelets in the cluster do not have allowed-unsafe-sysctls flag configured.

apply the deployment config and you should see over a 100 pods created under a minute.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
 kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:57:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:46:57Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Debian Stretch
  • Kernel (e.g. uname -a): Linux 4.9

/kind bug

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 29
  • Comments: 55 (38 by maintainers)

Most upvoted comments

đź‘‹ chiming in from the sidelines over here as someone who was recently impacted by this bug and the subsequent cascading failure of the control plane due to creation of 15k Pod objects in a few minutes. (The result of that may or may not have made it difficult for people to load this very page :octocat: :trollface:).

This problem is somewhat serious in that it creates a Denial of Service vector for anyone who has access to deploy resources to the cluster. I haven’t been able to figure out a way to even mitigate it, as ResourceQuota are also impacted by the race condition that exists due to not considering Failed pods as counting toward the quota.

My understanding of the problem is:

  1. ReplicaSetController finds not enough Pods, so creates them
  2. ResourceQuota admission control ignores Failed and Succeeded Pods so admits new Pods
  3. New Pods enter Pending
  4. kube-scheduler schedules Pending pods onto a Node
  5. kubelet says “nope, I don’t support that sysctl”, marks Pods as “Failed”
  6. GOTO 1

To my 👀, it seems this is primarily a scheduling problem. If the scheduler had the right information to know about the capabilities of the Node, Pods would stay in Pending and the cycle would be broken. Backoff seems warranted as a fallback, but doesn’t really solve or even mitigate the problem. It will make it take longer before the control-plane melts down, though.

It’s only peripherally related to this issue, but --allowed-unsafe-sysctls being a kubelet parameter feels a little off to me and seems like would be better suited as a cluster-wide parameter (perhaps at the kube-apiserver level or something?). I’m challenged trying to think of a case where I would want to node scope this parameter particularly since there’s no way to tell the scheduler about it that scoping.

❤️ to those looking at this issue.

This Issue Resolved for me with Doing This Steps:

We must allow “allowedUnsafeSysctls:” in kubelet configuration of ALL worker nodes:

cat << EOF >> /var/lib/kubelet/config.yaml
allowedUnsafeSysctls:
- "net.ipv6.conf.all.disable_ipv6"
EOF

Then restart kubelet service for apply change:

systemctl restart kubelet

I don’t want to add special logic to all workloads controllers to deal with the scheduler placing Pods on Nodes were they can’t possibly run. Controller authors (imo) should not have to reason about the Scheduler’s behavior wrt to placement. This leads to a poor separation of concerns (imo). With the work that we’re doing in extensiblilty to allow users to write custom workload controllers (e.g. Operators) this propagates the concern even more broadly.

Is there a plan for this?

Several proposals here

1. kubelet fix: fix `kubelet fails fast` first

2. kubelet add annotations and scheduler:  kubelet may need to add annotations to node for `allow-unsafe-sysctls`, so that scheduler can know it.

3. add an admission controller to add node affinities for pods with unsafe sysctls.

4. kubelet --allow-unsafe-sysctls will add taint to node by default, so pods with unsafe sysctls need tolerations for it.

Two workarounds:

1. To use this feature, use the taints and toleration feature or labels on nodes to schedule those pods onto the right nodes.

2. use `ResourceQuota` to limit continuously pods creating.

I’d like to add a KEP or PR for any of the proposals.

I concur. This seems still a very relevant issue worth keeping on the radar, but I think better to file a more actionable issue for node once we reach consensus /remove-sig node

I think a mechanism should be added for kubelet to convey to scheduler that given pod spec cannot be satisfied on certain nodes. SysctlForbidden is one such case. CPU affinity management is another case.

Pardon me if I’m asking the obvious thing: why can’t replicaSet controller just delete failed pods it created? Or when creating new pods, considering the failed pods, and DO NOT create new one when there is failed pods?

I think kubelet can provide a consistent message for pods like this, whether we decide to use softAdmitHandlers or admitHandlers, so that the controllers may have a chance to act differently. Right now, the soft admission handlers provide a “Blocked” reason. If we continue using the non-soft admission handler, perhaps we can provide a more generic reason? SysctlForbidden is too specific, and the controllers cannot keep appending the list of reasons to watch.

Also this likely goes beyond just SysctlForbiddencase; we need scheduler decisions to match kubelet admission when binding a node or this can reappear in another form.

Stateful controller also has similar code which deletes failed pods and recreates them. I guess the thing to think about here is

  • why are the pods not getting deleted if they are failed or sysctlforbidden is not same as failed ?

I would like to work on this to fix this issue, can someone assign it to me?

/assign @surajssd

It would probably be helpful to sketch out the approach you plan to take here to make sure it makes sense from the node and controller’s perspective

thanks for the details @surajssd I have added this issue to the open discussion agenda for sig-apps tomorrow. Please attend if possible. /area kubelet

I meant GKE doesn’t show anything above 1.11.x As per this https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/ A pod with the unsafe sysctls will fail to launch on any node which has not enabled unsafe sysctls.

Failure to launch a pod, in my understanding should not create new ones. E.g. if i create a new pod with an invalid image, the pod remains in ErrImagePull state.

  • Can you share the deployment config you are using ? (how many replicas , maxSurge, maxUnavailable, etc).
  • are you seeing this behavior when a new Deployment is created or an existing deployment is updated ?
  • Can you share the logs of kube-controller-manager ?

quota only prevents absolute numbers of pods. it does not prevent high create/delete/recreate rates

As an interim safeguard, maybe you could configure a quota?

Adding in node and scheduling to comment on the proposed solution. /sig node /sig scheduling

This issue was raised with sig scheduling. Unfortunately, the ReplicaSetController, Scheduler, and kubelet can all be said to be behaving correctly here. Same goes for admission.

SIG Apps can comment on whether it makes sense to put backoff mechanisms into each controller. The scheduler has a backoff mechanism for a pod that cannot be scheduled, but we do not believe the scheduler can/should understand when a new pod is a replacement for a terminated pod.

Having a check in admission probably doesn’t make sense because different nodes can have different sysctl whitelists. In fact, because the sysctl whitelist is a command line flag to kubelet, controllers cannot make decisions based on sysctl requests.

One proposed solution:

  • kubelet adds the whitelist to NodeStatus
  • scheduler adds a SysctlAllowed predicate
  • misconfigured pod will get stuck in pending state with a SysctlForbidden reason.