kubernetes: Nodes status is NotReady

I have a cluster and over the weekend some of the pods went to pending status, this caused me to find that the nodes are NotReady state. They were all Ready last week. I’m not sure how to debug what happened or remove the broken nodes / fix the issue.

kc get nodes
gke-cluster-1-default-pool-777adf16-an5j   Ready      4d
gke-cluster-1-default-pool-777adf16-erra   NotReady   4d
gke-cluster-1-default-pool-777adf16-ge2r   Ready      4d
gke-cluster-1-default-pool-777adf16-t2aj   Ready      4d
gke-cluster-1-default-pool-777adf16-vvhx   Ready      4d
gke-cluster-1-default-pool-777adf16-w20k   NotReady   4d
gke-cluster-1-default-pool-777adf16-wib8   Ready      4d
gke-cluster-1-default-pool-777adf16-wizq   Ready      4d
gke-cluster-1-default-pool-777adf16-wteu   Ready      4d
gke-cluster-1-default-pool-777adf16-x07o   Ready      4d
gke-cluster-1-default-pool-777adf16-xhfh   Ready      4d
gke-cluster-1-default-pool-777adf16-xsix   NotReady   4d
gke-cluster-1-default-pool-777adf16-y98j   NotReady   4d
gke-cluster-1-default-pool-777adf16-yjxa   Ready      4d
gke-cluster-1-default-pool-777adf16-z7cz   Ready      4d
gke-cluster-1-default-pool-777adf16-z8cn   Ready      4d

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.3", GitCommit:"c6411395e09da356c608896d3d9725acab821418", GitTreeState:"clean", BuildDate:"2016-07-22T20:29:38Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.5", GitCommit:"b0deb2eb8f4037421077f77cb163dbb4c0a2a9f5", GitTreeState:"clean", BuildDate:"2016-08-11T20:21:58Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Google cloud platform container engine

What happened:

Nodes went offline

What you expected to happen:

Nodes to stay online

How to reproduce it (as minimally and precisely as possible):

Unsure

Anything else do we need to know:

The pods that were on those nodes moved to Pending status:

some-api-2437792557-nm63f      0/1       Pending   0          1d

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 3
  • Comments: 23 (5 by maintainers)

Most upvoted comments

I am running into this issue with my GKE clusters.

I think I solved it TLDR I’m overloading the cluster’s CPU (using smallest node size!!)

Environment

  • 1.7.11-gke.1, us-west-1a.
  • I had a ~8 container service deployed to the cluster
  • Instance size used is the smallest allowed for GKE cluster.
  • Left it running for 1 night

My issues

Diagnostics

  1. Run kubectl describe node <NotReady node>
Conditions:
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason                Message
  ----                 ------    -----------------                 ------------------                ------                -------
  KernelDeadlock       False     Mon, 22 Jan 2018 21:04:55 -0800   Mon, 22 Jan 2018 00:13:12 -0800   KernelHasNoDeadlock   kernel has no deadlock
  NetworkUnavailable   False     Sun, 21 Jan 2018 20:23:01 -0800   Sun, 21 Jan 2018 20:23:01 -0800   RouteCreated          RouteController created a route
  OutOfDisk            Unknown   Mon, 22 Jan 2018 11:53:29 -0800   Mon, 22 Jan 2018 11:54:26 -0800   NodeStatusUnknown     Kubelet stopped posting node status.
  MemoryPressure       Unknown   Mon, 22 Jan 2018 11:53:30 -0800   Mon, 22 Jan 2018 11:54:26 -0800   NodeStatusUnknown     Kubelet stopped posting node status.
  DiskPressure         Unknown   Mon, 22 Jan 2018 11:53:30 -0800   Mon, 22 Jan 2018 11:54:26 -0800   NodeStatusUnknown     Kubelet stopped posting node status.
  Ready                Unknown   Mon, 22 Jan 2018 11:53:30 -0800   Mon, 22 Jan 2018 11:54:26 -0800   NodeStatusUnknown     Kubelet stopped posting node status.
Addresses:
  1. SSH’d into the NotReady status node, and ran sudo journalctl -u kubelet --all | tail
Jan 23 05:04:00 gke-of-cluster-default-pool-abba87db-4mw2 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Jan 23 05:04:01 gke-of-cluster-default-pool-abba87db-4mw2 systemd[1]: Stopped Kubernetes kubelet.
Jan 23 05:04:01 gke-of-cluster-default-pool-abba87db-4mw2 systemd[1]: Started Kubernetes kubelet.
Jan 23 05:05:08 gke-of-cluster-default-pool-abba87db-4mw2 kubelet[12648]: I0123 05:04:51.215030   12648 feature_gate.go:144] feature gates: map[ExperimentalCriticalPodAnnotation:true]
Jan 23 05:05:08 gke-of-cluster-default-pool-abba87db-4mw2 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Jan 23 05:05:09 gke-of-cluster-default-pool-abba87db-4mw2 systemd[1]: Stopped Kubernetes kubelet.
Jan 23 05:05:09 gke-of-cluster-default-pool-abba87db-4mw2 systemd[1]: Started Kubernetes kubelet.
Jan 23 05:06:14 gke-of-cluster-default-pool-abba87db-4mw2 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Jan 23 05:06:14 gke-of-cluster-default-pool-abba87db-4mw2 systemd[1]: Stopped Kubernetes kubelet.
Jan 23 05:06:14 gke-of-cluster-default-pool-abba87db-4mw2 systemd[1]: Started Kubernetes kubelet.
  1. Went to the StackDriver logging for the NotReady node VM instance. Found this interesting message:
jsonPayload: {
  MESSAGE:  "W0123 05:58:13.394667   16675 cni.go:189] Unable to update cni config: No networks found in /etc/cni/net.d"   
  1. Oh it looks like I’m overloading the cluster. This is an image of the CPU utilization over the last day: https://i.boring.host/LG1ADVgz.png