kubernetes: Memory pressure shouldn't hard evict static pods

If a node runs out of memory, we kill something. Then if there is memory pressure, we don’t admit best effort (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/eviction/eviction_manager.go#L110). Maybe we can admit static-pods, even if they’re best effort?

@kubernetes/sig-node if we can’t differentiate static from non-static, maybe we can use the scheduler.alpha.kubernetes.io/critical-pod annotation?

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 47 (46 by maintainers)

Commits related to this issue

Most upvoted comments

@thockin @davidopp @dchen1107 @erictune @janetkuo and myself discussed this issue further today. Following is the summary:

The primary problems we addressed are as follows:

  1. Static system pods (kube-proxy & fluentd on certain deployments) get evicted and are not scheduled back on nodes
  2. Static system pods are not “guaranteed” to run on a node and can fail feasibility checks.
  3. Existing alpha API for specifying pod priority using pod level annotations causes security issues

In the short term,

  1. Continue using “pod level annotations” for priority.
  2. Feature gate the “Critical Pod Annotation” across the system, tie it to a configurable namespace (or just to kube-system), and disable it by default. GKE will enable this feature.
  3. Static pods can be marked as “critical”
  4. Kubelet will not evict “critical pods” and instead restart them.
  5. Kubelet will guarantee that static pod manifests are processed before API pods are processed. This is to ensure that static pods can fit on a node.
  6. A static pod or a daemonset that gets updated might not be accepted by the node due to resource constraints, even if they are critical.

We decided to have only two levels of priority (Critical & QoS) instead of three (Static, Critical and QoS) for evictions. This is to align with the long term plan.

This short term solution will solve problems 1 & 3 mentioned above.

Now for the long term,

  1. A preemption & priority scheme will be designed that will allow us to deprecate the “Critical” pod level annotation. @davidopp is working on this. This solves problem 3. ETA: Design in v1.6, Alpha (or beta?) in v1.7
  2. As part of the this design, Kubelet’s role in preemptions will be finalized based on which static pods can be admitted on to a fully committed node. Solves problem 2. ETA
  3. Critical static pods will use this new priority scheme to guarantee their availability. Solves problem 1. ETA: TBD (based on availability of the feature)
  4. [Closely related] Existing Static pods (kube-proxy & fluentd) will switch to using Daemon Sets once high priority daemon set pods are “guaranteed” to run on a node. As of now “Daemon Set” pods bypass the scheduler and can get “rejected” by the kubelet.

@derekwaynecarr @liggitt @smarterclayton does this satisfy your needs?