kubernetes: New pod incorrectly gets scheduled on the node when there is no capacity

What happened?

When the node is running at its full capacity and no more pods can be scheduled, the rest of the pods are in Pending state as expected. But at this point, if we add a static pod then one of the running pods will get evicted to make room for the incoming static pod.

However, the moment pod eviction is completed scheduler will try to send one of the Pending pod at that host, but this will fail the pod with OutOfpods error as the only slot that was opened by evicting the running pod on that node was for the static pod.

What did you expect to happen?

When the node is running at the full capacity (max-pods), and a static pod is added, scheduler should not schedule existing Pending pod on that host.

How can we reproduce it (as minimally and precisely as possible)?

  1. Start local cluster with max-pod = 3 for easily testing,

KUBELET_FLAGS=--max-pods=3 CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote hack/local-up-cluster.sh

I am using crio but you don’t have to, the issue is with the kubelet so it doesn’t matter which runtime you use.

  1. Start about 5 pods,
kubectl run busybox-1 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf
kubectl run busybox-2 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf
kubectl run busybox-3 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf
kubectl run busybox-4 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf
kubectl run busybox-5 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf
  1. Once you have some pods Running and rest are Pending (depends how many are running in kube-system namespace. If you need increase --max-pods and try again). In my case there was only 1 in kube-system. So I had 2 Running and rest Pending
$ kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
busybox-1   1/1     Running   0          33s
busybox-2   1/1     Running   0          20s
busybox-3   0/1     Pending   0          16s
busybox-4   0/1     Pending   0          7s
busybox-5   0/1     Pending   0          3s
  1. Create a static pod,
[root@localhost static-pods]# cat > test.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: static-web
  labels:
    role: myrole
spec:
  containers:
    - name: web
      image: nginx
      ports:
        - name: web
          containerPort: 80
          protocol: TCP
[root@localhost static-pods]# pwd
/run/kubernetes/static-pods
  1. Watch it crash and burn 😃
$ kubectl get pods
NAME                   READY   STATUS              RESTARTS   AGE
busybox-1              1/1     Running             0          88s
busybox-2              0/1     Error               0          75s
busybox-3              0/1     OutOfpods           0          71s
busybox-4              0/1     OutOfpods           0          62s
busybox-5              0/1     Pending             0          58s
static-web-127.0.0.1   0/1     ContainerCreating   0          4s

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v0.21.0-beta.1", GitCommit:"d0259f5a5ca1338a68603409a554a554d2c0f6f8", GitTreeState:"clean", BuildDate:"2021-05-21T08:44:40Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.0-alpha.1.41+cc6f12583f2b61", GitCommit:"cc6f12583f2b611e9469a6b2e0247f028aae246b", GitTreeState:"clean", BuildDate:"2021-12-10T10:31:12Z", GoVersion:"go1.17.2", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (0.21) and server (1.24) exceeds the supported minor version skew of +/-1

Cloud provider

N/A

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 20 (15 by maintainers)

Most upvoted comments

When the node is running at the full capacity (max-pods), and a static pod is added, scheduler should not schedule existing Pending pod on that host.

In the order described above, you created normal pods that got scheduled on a node. Then you added a static pod, which is not “scheduled” (the kubelet directly receives that pod, so the scheduler has to react). In that case, the static pod should start, and I would generally expect 1 other pod on that node to get OutOfPods (because the static pod “wins”).

However, why is pod 2 in your list in “Error”? It’s possible that the explanation for your “crash and burn” is that pod 2 failed (due to kubelet incorrectly evicting it, or its own process exited), and then scheduler saw there was a gap, and tried to place 3 or 4, which the kubelet immediately rejected (because the static pod was starting but the scheduler hadn’t seen it) as OutOfPods.

So if we know why pod 2 is in Error, then we can figure out what happened, but in general I’d say the “crash and burn” looks like a normal race behavior where static pod creation on kubelet and scheduler placement are racing to try to leverage the gap that the kubelet creates for the static pod (when it shuts down pod 2). However, pod 2 should definitely say OutOfPods, not Error in that scenario.