kubernetes: DaemonSet scheduling is broken in combination with admission plugins

Let’s take a specific example of admission plugin PodNodeSelector but this applies in general for admission plugins modifying pod properties having an influence on scheduling.

The issue is that DaemonSet doesn’t see those admission modifications when determining to which nodes it should schedule pods resulting in scheduling pods even to nodes where it shouldn’t. Or in theory not scheduling pods where it should have if the admission plugin were loosening the nodeSelector instead of tightening it.

We need to account for this case as those admission plugins are important part of Kubernetes security model.

In case the DS specifies nodeName directly (which is still the current scheduling for DS) this results in kubelet failing the pod because of failure to match nodeSelector. So a pod with restartPolicy: Always gets failed and DS removes it and creates a new one, creating a loop that can cripple the cluster.

In case we move to scheduling by affinity this results in excessive pod creation left in pending state. It can be thousands of pods stuck in pending based on the cluster size and how much the nodeSelector is restrictive, degrading performance, possibly exhausting quota.

The issue in both cases is that DaemonSet make scheduling decisions before the pod is created based purely on it’s template with no idea on how the pod will look like when created. Like in the case when the nodeSelector will be actually different that what it accounted for.

We either need: a) DS to be able to simulate the admission chain for that pod to be able to see the modifications b) Creating the pods first and assigning nodes only after that which would still result in excessive pod(s) being created, but possibly only 1 (or batch_size) as you could stop at the point where no more nodes needed the DS pod to be placed based on the actual nodeSelector (and other properties) in the pod created. c)

This might also affect the current move of part of DS scheduling to the default scheduler.

Here are some of the issues showing collisions between admission plugins and DS scheduling:

/kind bug /priority important-soon /sig apps /sig schedulling @kow3ns @janetkuo @deads2k @liggitt @smarterclayton @mfojtik @kubernetes/sig-apps-bugs @kubernetes/sig-scheduling-bugs

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 36 (34 by maintainers)

Most upvoted comments

Side note: in @lavalamp’s kubectl apply reboot (server-side apply) design doc, supporting dry-run is one of the required changes.