kubernetes: Kubernetes scheduler fails to schedule pods on nodes with enough resources.
A follow up of: https://github.com/kubernetes/kubernetes/issues/34772
I have a node with:
Name: gke-test-cluster-1-default-pool-2fef8206-jk88
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=custom-4-8192
beta.kubernetes.io/os=linux
cloud.google.com/gke-nodepool=default-pool
failure-domain.beta.kubernetes.io/region=us-central1
failure-domain.beta.kubernetes.io/zone=us-central1-b
kubernetes.io/hostname=gke-test-cluster-1-default-pool-2fef8206-jk88
Taints: <none>
CreationTimestamp: Mon, 21 Nov 2016 18:44:36 +0100
Phase:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Mon, 21 Nov 2016 19:48:38 +0100 Mon, 21 Nov 2016 19:48:38 +0100 RouteCreated RouteController created a route
OutOfDisk False Mon, 21 Nov 2016 19:48:37 +0100 Mon, 21 Nov 2016 18:44:36 +0100 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Mon, 21 Nov 2016 19:48:37 +0100 Mon, 21 Nov 2016 18:44:36 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 21 Nov 2016 19:48:37 +0100 Mon, 21 Nov 2016 18:44:36 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Mon, 21 Nov 2016 19:48:37 +0100 Mon, 21 Nov 2016 18:45:06 +0100 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses: 10.240.0.5,146.148.80.118
Capacity:
alpha.kubernetes.io/nvidia-gpu: 0
cpu: 4
memory: 8168548Ki
pods: 110
Allocatable:
alpha.kubernetes.io/nvidia-gpu: 0
cpu: 4
memory: 8168548Ki
pods: 110
System Info:
Machine ID: 0251ec6b2c2821ef342fe6d15833325b
System UUID: 0B3867CF-4808-B8E4-C80C-37EF3ADF7373
Boot ID: f986c6b3-0c82-48ce-82a7-2856d61b0be9
Kernel Version: 4.4.21+
OS Image: Google Container-VM Image
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.11.2
Kubelet Version: v1.4.6
Kube-Proxy Version: v1.4.6
PodCIDR: 10.152.4.0/24
ExternalID: 3039357858658290851
Non-terminated Pods: (7 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
default my-app-2781155866-0fxn9 550m (13%) 0 (0%) 0 (0%) 0 (0%)
default my-app-2781155866-4p8lv 550m (13%) 0 (0%) 0 (0%) 0 (0%)
default my-app-2781155866-hph3k 550m (13%) 0 (0%) 0 (0%) 0 (0%)
default my-app-2781155866-ulbjk 550m (13%) 0 (0%) 0 (0%) 0 (0%)
default my-app-2781155866-vyp5w 550m (13%) 0 (0%) 0 (0%) 0 (0%)
kube-system fluentd-cloud-logging-gke-test-cluster-1-default-pool-2fef8206-jk88 80m (2%) 0 (0%) 200Mi (2%) 200Mi (2%)
kube-system kube-proxy-gke-test-cluster-1-default-pool-2fef8206-jk88 100m (2%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
2930m (73%) 0 (0%) 200Mi (2%) 200Mi (2%)
On the other hand
kubectl describe pod my-app-2781155866-8kr0g
Name: my-app-2781155866-8kr0g
Namespace: default
Node: gke-test-cluster-1-default-pool-2fef8206-ae7w/
Start Time: Mon, 21 Nov 2016 19:04:26 +0100
Labels: name=my-app
pod-template-hash=2781155866
Status: Failed
Reason: OutOfcpu
Message: Pod
IP:
Controllers: ReplicaSet/my-app-2781155866
Containers:
my-app:
Image: ubuntu:16.04
Port: 8000/TCP
Command:
/bin/bash
-c
Args:
apt update; apt install stress; stress --cpu 1 --io 1 & sleep 999999
Requests:
cpu: 550m
Volume Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-5ora1 (ro)
Environment Variables: <none>
Volumes:
default-token-5ora1:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-5ora1
QoS Class: Burstable
Tolerations: <none>
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1h 57m 7 {default-scheduler } Warning FailedScheduling pod (my-app-2781155866-8kr0g) failed to fit in any node
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-ae7w): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-jk88): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-6j58): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-9mys): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-9gxe): Insufficient cpu
1h 48m 11 {default-scheduler } Warning FailedScheduling pod (my-app-2781155866-8kr0g) failed to fit in any node
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-jk88): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-6j58): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-9mys): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-9gxe): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-ae7w): Insufficient cpu
1h 47m 8 {default-scheduler } Warning FailedScheduling pod (my-app-2781155866-8kr0g) failed to fit in any node
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-9mys): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-9gxe): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-ae7w): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-jk88): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-6j58): Insufficient cpu
1h 47m 12 {default-scheduler } Warning FailedScheduling pod (my-app-2781155866-8kr0g) failed to fit in any node
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-9gxe): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-ae7w): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-jk88): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-6j58): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-9mys): Insufficient cpu
1h 47m 19 {default-scheduler } Warning FailedScheduling pod (my-app-2781155866-8kr0g) failed to fit in any node
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-6j58): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-9mys): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-9gxe): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-ae7w): Insufficient cpu
fit failure on node (gke-test-cluster-1-default-pool-2fef8206-jk88): Insufficient cpu
46m 46m 1 {default-scheduler } Normal Scheduled Successfully assigned my-app-2781155866-8kr0g to gke-test-cluster-1-default-pool-2fef8206-ae7w
46m 46m 1 {kubelet gke-test-cluster-1-default-pool-2fef8206-ae7w} Warning OutOfcpu
The pod should be able to schedule on the node but it fails for unknown reason.
Reseting the scheduler helps, another pod is scheduled on the node and the resource usage jumps from 2930 to 3480. So it seems that the bug is somewhere around caching or retries.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 28 (18 by maintainers)
Commits related to this issue
- Merge pull request #37284 from wojtek-t/extend_scheduler_log Automatic merge from submit-queue Log when pod expires in scheduler Ref #37232 — committed to kubernetes/kubernetes by deleted user 8 years ago
- Merge pull request #37379 from wojtek-t/safe_schedulercache Automatic merge from submit-queue Try self-repair scheduler cache or panic Fix #37232 — committed to kubernetes/kubernetes by deleted user 8 years ago
I just ran into this, how can I get this to work properly?