kubernetes: Scheduling errors and OutOfCpu during cluster autoscaling

Is this a request for help?: No

What keywords did you search in Kubernetes issues before filing this one? OutOfCpu Issue #29846 may be related; however, #29846 is not related to cluster autoscaling and is limited to an OutOfCpu condition.


Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

Kubernetes version (use kubectl version): 1.4.0

Environment: Google GKE

  • Cloud provider or hardware configuration: Google
  • OS (e.g. from /etc/os-release): Google Container-VM Image V55
  • Kernel (e.g. uname -a): Linux gke-omega-cluster-default-pool-d07fc610-clps 4.4.14+ #1 SMP Tue Sep 20 10:32:07 PDT 2016 x86_64 Intel® Xeon® CPU @ 2.30GHz GenuineIntel GNU/Linux
  • Install tools:
  • Others:

What happened: Using cluster of n1-standard-1 nodes, 1 initial node and enabled cluster autoscaling, I started 34 pods requesting 100m CPU each. Since each node has only 1 CPU and 200m CPU is required for Kubernetes services on each node (300m CPU on the first node), each node can run 8 pods (7 for the first node). All 34 pods would, therefore, required 5 nodes. The sequence of events:

Around 15:06:30: All pods are created. As expected, most pods remain in Pending state. 15:06:42: Kubernetes correctly triggers scale-up; however, the scheduler only requests a single node. 15:07:34: Two pods (“out-of-cpurf679” and “out-of-cpuwi5ge”) are killed with “Deleted by rescheduler in order to schedule critical pod kube-system_heapster-v1.2.0-3455740371-j973x”. 15:09:33: Scale-up is triggered again, scaling from 2 to 3 nodes. 15:11:48: Two other pods (“out-of-cpu3gsro” and “out-of-cpun30p1”) fail with OutOfCpu when 10 pods are assigned to the newly added 3rd node. 15:12:31: Next scale-up, going from 3 to 4 nodes. Eventually, 30 pods are running, 2 pods are in OutOfCpu state, and 2 pods have disappeared. 15:15:16: Another scale-up is triggered from 4 to 5 nodes. The 5th node remains idle as there are no more pods to schedule.

Please refer to this zip file for details: out-of-cpu-bug.zip

“out-of-cpu-bug.yaml” is the YAML that was used to run the experiment through the following commands:

1> gcloud container clusters create omega-cluster --machine-type=n1-standard-1 --zone us-central1-c --num-nodes 1 --enable-autoscaling --min-nodes=1 --max-nodes=8 2> bash -c ‘for i in {1…34}; do kubectl create -f out-of-cpu-bug.yaml; done’

“out-of-cpu-bug-events.txt” is the capture of events from ‘kubectl get events -w’ up to the final scale-up. “out-of-cpu-bug-pods1.txt” shows the output of ‘kubectl get nodes,pods -a’ after the OutOfCpu condition of two pods. Note that only 32 pods show up, because 2 pods have been killed earlier and have “disappeared”. “out-of-cpu-bug.pods2.txt” shows the nodes and pods after the final scale-up. The new node “gke-omega-cluster-default-pool-d07fc610-w79v” is unused.

What you expected to happen:

  1. Scale-up should have been triggered to bring the cluster from 1 to 5 nodes in a single step.
  2. Pods “out-of-cpurf679” and “out-of-cpuwi5ge” should not have been killed and disappeared.
  3. Pods “out-of-cpu3gsro” and “out-of-cpun30p1” should not have been scheduled onto a full node and should not have received an OutOfCpu condition.
  4. Scale-up should not have produced an idle node.

How to reproduce it (as minimally and precisely as possible):

See above. Create a cluster on Google with gcloud container clusters create omega-cluster --machine-type=n1-standard-1 --zone us-central1-c --num-nodes 1 --enable-autoscaling --min-nodes=1 --max-nodes=8

Schedule pods with the supplied YAML: bash -c ‘for i in {1…34}; do kubectl create -f out-of-cpu-bug.yaml; done’

The exact numbers don’t matter.

Anything else do we need to know:

The bug(s) appear to include some race condition as the error does not always (although frequently) happen. The more pods that are being created, the more likely the situation is observed. The YAML has a commented-out section to request pod affinity. This seems to increase the likelyhood of OutOfCpu failures. Nonetheless, the above commands will show the failure with a 1 in 2 chance. Increase the number of pods to 60 or more and the problem occurs almost every time. I consider this problem a showstopper for any serious cluster autoscaling application.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 23 (9 by maintainers)

Most upvoted comments

In case you come here: The scheduler was completely rewritten since 1.4. So even if there is a similar looking issue, it can’t be the same.