kubernetes: Static pods stopped first when priority is not explicitly set

What happened?

etcd / kube-apiserver stops before most containers, despite using shutdownGracePeriodByPodPriority. On a single node server, with multus (and likely other CNI talking to the API), this spam the logs and make the shutdown slower

What did you expect to happen?

shutdownGracePeriodByPodPriority should stop one group at a time

How can we reproduce it (as minimally and precisely as possible)?

On a single node cluster, have a slow to shutdown pod (sleep inf), look at the logs to see in which order they are killed

Anything else we need to know?

shutdownGracePeriodByPodPriority:
  - priority: 2000000001
    shutdownGracePeriodSeconds: 10
  - priority: 2000000000
    shutdownGracePeriodSeconds: 10
  - priority: 0
    shutdownGracePeriodSeconds: 60

Reading the kep https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2712-pod-priority-based-graceful-node-shutdown, the summary says Kubelet graceful shutdown should take the pod priority values into account to determine the order in which the pods are stopped. so I would expect all pods with priority between 0 and 2000000000 to be stopped first, then coredns (2000000000), then the rest.

This is definitely not what’s happening:

Nov 15 21:02:02 atsc2 kubelet[9573]: I1115 21:02:02.083004    9573 nodeshutdown_manager_linux.go:134] "Creating node shutdown manager" shutdownGracePeriodRequested="0s" shutdownGracePeriodCriticalPods="0s" shutdownGracePeriodByPodPriority=[{Priority:0 ShutdownGracePeriodSeconds:60} {Priority:2000000000 ShutdownGracePeriodSeconds:10} {Priority:2000000001 ShutdownGracePeriodSeconds:10}]
...
Nov 15 21:06:09 atsc2 kubelet[9573]: I1115 21:06:09.372579    9573 kuberuntime_container.go:722] "Killing container with a grace period" pod="kube-system/kube-scheduler-atemeappliance" podUID=81872379106beaec249553e0efae9ec6 containerName="kube-scheduler" containerID="containerd://282605472eed19f2d131d9805b6616682c677af2ab281be0763d55bdb7a7bad8" gracePeriod=30
Nov 15 21:06:09 atsc2 kubelet[9573]: I1115 21:06:09.372949    9573 kuberuntime_container.go:722] "Killing container with a grace period" pod="kube-system/kube-controller-manager-atemeappliance" podUID=ec037a7f739a4afa72ccb6799fceb193 containerName="kube-controller-manager" containerID="containerd://77d0c10f17b4a95aa30b11615e838868fbab6a9c53f650bc75d8ed94ff6f8173" gracePeriod=30
Nov 15 21:06:09 atsc2 kubelet[9573]: I1115 21:06:09.373028    9573 kuberuntime_container.go:722] "Killing container with a grace period" pod="kube-system/etcd-atemeappliance" podUID=a42b112129999807250e3e1cb281cd6c containerName="etcd" containerID="containerd://9774566e9a53918fb4648cdd7413b7ebc2eee4ac2cbdc09d33c2c5bea9874761" gracePeriod=30
Nov 15 21:06:09 atsc2 kubelet[9573]: I1115 21:06:09.373100    9573 kuberuntime_container.go:722] "Killing container with a grace period" pod="kube-system/kube-apiserver-atemeappliance" podUID=56ed88c0ff350d8516fd285311afe657 containerName="kube-apiserver" containerID="containerd://c93447aa665d04e1e66784333b2abe2231c96ef465f8c6fec51d8537cbec75fa" gracePeriod=30
Nov 15 21:06:09 atsc2 kubelet[9573]: I1115 21:06:09.373388    9573 kuberuntime_container.go:722] "Killing container with a grace period" pod="kube-system/kube-sriov-device-plugin-dds5b" podUID=310293cf-ee11-4fa4-acad-0b87559e3836 containerName="kube-sriovdp" containerID="containerd://9ec25c43fb5c1c017f1d5576fb6feb42a703d872813d84b700d613b99fa6c419" gracePeriod=30
Nov 15 21:06:09 atsc2 kubelet[9573]: I1115 21:06:09.373509    9573 kuberuntime_container.go:722] "Killing container with a grace period" pod="default/REDACTED" podUID=1d1880ee-6b18-42e0-810a-b12100763fd5 containerName="REDACTED" containerID="containerd://141e1363c2e3f7db4e6b5dcbc51ac6d8975cb244435659f20c0d8309107a1b8f" gracePeriod=30
Nov 15 21:06:09 atsc2 kubelet[9573]: I1115 21:06:09.560894    9573 kuberuntime_container.go:722] "Killing container with a grace period" pod="default/REDACTED" podUID=3597a4c8-32a8-4089-bbd3-77c024f45dbe containerName="REDACTED" containerID="containerd://1d29a04cd823e549b6d1bbb58dc6afc2d8ceb99f0b8631796378d8576bb44c96" gracePeriod=45
Nov 15 21:06:09 atsc2 kubelet[9573]: I1115 21:06:09.565408    9573 kuberuntime_container.go:722] "Killing container with a grace period" pod="default/REDACTED" podUID=0c3d38c6-12f7-4253-9f54-a31b5c2d01d5 containerName="REDACTED" containerID="containerd://40fec73c8db538ed30e366853a1b83c3dfde2e7ced25d8bb681271d3950a1a8b" gracePeriod=30
Nov 15 21:06:09 atsc2 kubelet[9573]: I1115 21:06:09.574084    9573 kuberuntime_container.go:722] "Killing container with a grace period" pod="default/REDACTED" podUID=db2e44db-16dc-419a-8799-2b4cadb08063 containerName="REDACTED" containerID="containerd://a178800ba300018b202434017efda11d84e55fc32c148304769b8ef79e291ae7" gracePeriod=30
Nov 15 21:06:12 atsc2 kubelet[9573]: I1115 21:06:12.611940    9573 kuberuntime_container.go:722] "Killing container with a grace period" pod="default/REDACTED" podUID=a70a245b-fffc-4b5f-b25e-ecb71ea09c35 containerName="REDACTED" containerID="containerd://a21c392ec48551915b807380d3084e252057e30fef501b21a65778202f4149d4" gracePeriod=30
Nov 15 21:06:20 atsc2 kubelet[9573]: I1115 21:06:20.549014    9573 kuberuntime_container.go:722] "Killing container with a grace period" pod="ingress-nginx/ingress-nginx-controller-85cbcdf4dd-kbpr7" podUID=8b408d5a-d723-4869-93e5-2436a4ce891a containerName="controller" containerID="containerd://bd3b70c0db023eddadc65cdf24eb18202414751a4996fdd0fbdc9b6ef9639665" gracePeriod=20
Nov 15 21:07:10 atsc2 kubelet[9573]: I1115 21:07:10.138212    9573 kuberuntime_container.go:722] "Killing container with a grace period" pod="kube-system/coredns-5dcd989fd8-drjsv" podUID=1132caa1-9550-407c-8b75-d5c02e750269 containerName="coredns" containerID="containerd://6f9cfc43cfb0162d5258c16c4ee7bc1801135257b1a4a580687d7af26488f5b3" gracePeriod=10
Nov 15 21:07:19 atsc2 kubelet[9573]: I1115 21:07:19.373437    9573 kuberuntime_container.go:722] "Killing container with a grace period" pod="kube-system/kube-proxy-h2q9q" podUID=b17d6ff3-eefa-4012-8479-a0c06c47b1c6 containerName="kube-proxy" containerID="containerd://cc0d5fd4d420bc3292296d4505fbcccbf48388172c3158bfb69053eaacee9ee1" gracePeriod=10

...

Nov 15 21:07:18 atsc2 kubelet[9573]: E1115 21:07:18.191640    9573 kuberuntime_manager.go:999] "Failed to stop sandbox" podSandboxID={Type:containerd ID:791fda883bf2aa8b0ebc6381e8b7d7182e280bc2a83df112fbeb19c32062e61a}
Nov 15 21:07:18 atsc2 kubelet[9573]: E1115 21:07:18.191669    9573 kubelet.go:1784] failed to "KillPodSandbox" for "0c3d38c6-12f7-4253-9f54-a31b5c2d01d5" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"791fda883bf2aa8b0ebc6381e8b7d7182e280bc2a83df112fbeb19c32062e61a\": plugin type=\"multus-cni\" name=\"multus-cni-network\" failed (delete): Multus: [default/REDACTED]: error getting pod with error: Get \"https://198.19.254.254:6443/api/v1/namespaces/default/pods/REDACTED?timeout=1m0s\": dial tcp 198.19.254.254:6443: connect: connection refused"
Nov 15 21:07:18 atsc2 kubelet[9573]: E1115 21:07:18.191692    9573 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"0c3d38c6-12f7-4253-9f54-a31b5c2d01d5\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\"791fda883bf2aa8b0ebc6381e8b7d7182e280bc2a83df112fbeb19c32062e61a\\\": plugin type=\\\"multus-cni\\\" name=\\\"multus-cni-network\\\" failed (delete): Multus: [default/REDACTED]: error getting pod with error: Get \\\"https://198.19.254.254:6443/api/v1/namespaces/default/pods/REDACTED?timeout=1m0s\\\": dial tcp 198.19.254.254:6443: connect: connection refused\"" pod="default/REDACTED" podUID=0c3d38c6-12f7-4253-9f54-a31b5c2d01d5

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.6", GitCommit:"b39bf148cd654599a52e867485c02c4f9d28b312", GitTreeState:"clean", BuildDate:"2022-09-21T13:19:24Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.6", GitCommit:"b39bf148cd654599a52e867485c02c4f9d28b312", GitTreeState:"clean", BuildDate:"2022-09-21T13:12:04Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

N/A

OS version

# On Linux:
$ cat /etc/os-release

Appliance based on Alma Linux 8.6

$ uname -a
... 4.18.0-372.32.1.el8_6.x86_64 ...

Install tools

kubeadm

Container runtime (CRI) and version (if applicable)

containerd

Related plugins (CNI, CSI, …) and versions (if applicable)

multus

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 18 (17 by maintainers)

Most upvoted comments

https://github.com/kubernetes/kubernetes/blob/4db6bde859d912e11ab081a06e8ea17b2e044f5d/pkg/apis/core/types.go#L2969-L2973

champtar on Dec 7, 2022

Each shutdownGracePeriodByPodPriority will match all Pods whose priority is less than or equal to it.

This may suit your want

shutdownGracePeriodByPodPriority:
  - priority: 2000000001
    shutdownGracePeriodSeconds: 10
  - priority: 2000000000
    shutdownGracePeriodSeconds: 10
  - priority: 1999999999
    shutdownGracePeriodSeconds: 60

https://github.com/kubernetes/kubernetes/blob/3f823c0daa002158b12bfb2d53bcfe433516659d/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go#L646-L716

wzshiming on Nov 17, 2022