kubernetes: Kubelet stops reporting node status

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):

No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):

healthy node notready controller manager kubelet node status


Is this a BUG REPORT or FEATURE REQUEST? (choose one):

BUG REPORT

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.4", GitCommit:"7243c69eb523aa4377bce883e7c0dd76b84709a1", GitTreeState:"clean", BuildDate:"2017-03-08T02:50:34Z", GoVersion:"go1.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.4+coreos.0", GitCommit:"97c11b097b1a2b194f1eddca8ce5468fcc83331c", GitTreeState:"clean", BuildDate:"2017-03-08T23:54:21Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: Azure Standard_A1
  • OS (e.g. from /etc/os-release): NAME=“Container Linux by CoreOS” ID=coreos VERSION=1298.6.0 VERSION_ID=1298.6.0 BUILD_ID=2017-03-14-2119 PRETTY_NAME=“Container Linux by CoreOS 1298.6.0 (Ladybug)” ANSI_COLOR=“38;5;75” HOME_URL=“https://coreos.com/” BUG_REPORT_URL=“https://github.com/coreos/bugs/issues
  • Kernel (e.g. uname -a): Linux master-0-vm 4.9.9-coreos-r1 #1 SMP Tue Mar 14 21:09:42 UTC 2017 x86_64 Intel® Xeon® CPU E5-2673 v3 @ 2.40GHz GenuineIntel GNU/Linux
  • Install tools: https://github.com/edevil/kubernetes-deployment
  • Others:

What happened:

Nodes are marked as not healthy incorrectly.

What you expected to happen:

Nodes should not be marked as not healthy.

How to reproduce it (as minimally and precisely as possible):

I just setup a cluster using the aforementioned method and wait.

Anything else we need to know:

Controller manager log on -v=4

I0322 02:35:39.439687       1 nodecontroller.go:713] Node node-1-vm ReadyCondition updated. Updating timestamp.
I0322 02:35:39.439935       1 nodecontroller.go:713] Node node-2-vm ReadyCondition updated. Updating timestamp.
I0322 02:35:39.440085       1 nodecontroller.go:713] Node node-4-vm ReadyCondition updated. Updating timestamp.
I0322 02:35:39.440167       1 nodecontroller.go:713] Node master-1-vm ReadyCondition updated. Updating timestamp.
I0322 02:35:39.540323       1 attach_detach_controller.go:540] processVolumesInUse for node "node-0-vm"
I0322 02:35:40.869608       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:35:41.014154       1 attach_detach_controller.go:540] processVolumesInUse for node "node-3-vm"
I0322 02:35:41.058731       1 reflector.go:392] pkg/controller/informers/factory.go:89: Watch close - *extensions.DaemonSet total 0 items received
I0322 02:35:42.935111       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:35:43.414353       1 reflector.go:392] pkg/controller/garbagecollector/garbagecollector.go:768: Watch close - <nil> total 0 items received
I0322 02:35:43.901573       1 attach_detach_controller.go:540] processVolumesInUse for node "master-0-vm"
I0322 02:35:44.478340       1 nodecontroller.go:713] Node master-0-vm ReadyCondition updated. Updating timestamp.
I0322 02:35:44.478613       1 nodecontroller.go:713] Node node-0-vm ReadyCondition updated. Updating timestamp.
I0322 02:35:44.478780       1 nodecontroller.go:713] Node node-3-vm ReadyCondition updated. Updating timestamp.
I0322 02:35:44.961529       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:35:46.988408       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:35:49.064564       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:35:51.103435       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:35:52.769445       1 reflector.go:273] pkg/controller/resourcequota/resource_quota_controller.go:232: forcing resync
I0322 02:35:53.178768       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:35:54.043888       1 reflector.go:273] pkg/controller/volume/persistentvolume/pv_controller_base.go:455: forcing resync
I0322 02:35:54.043958       1 reflector.go:273] pkg/controller/volume/persistentvolume/pv_controller_base.go:454: forcing resync
I0322 02:35:54.044629       1 reflector.go:273] pkg/controller/volume/persistentvolume/pv_controller_base.go:159: forcing resync
I0322 02:35:54.339298       1 reflector.go:392] pkg/controller/garbagecollector/garbagecollector.go:768: Watch close - <nil> total 0 items received
I0322 02:35:55.236004       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:35:56.013491       1 reflector.go:273] pkg/controller/replication/replication_controller.go:220: forcing resync
I0322 02:35:57.093825       1 reflector.go:273] pkg/controller/endpoint/endpoints_controller.go:164: forcing resync
I0322 02:35:57.098093       1 endpoints_controller.go:338] Finished syncing service "trendex/trendex" endpoints. (3.858837ms)
I0322 02:35:57.098333       1 endpoints_controller.go:338] Finished syncing service "worten-ac-simulator/worten-ac-simulator" endpoints. (4.338892ms)
I0322 02:35:57.104371       1 endpoints_controller.go:338] Finished syncing service "bitshopping/bitshopping" endpoints. (9.914215ms)
I0322 02:35:57.105145       1 endpoints_controller.go:338] Finished syncing service "tv-directory/tv-directory" endpoints. (10.493105ms)
I0322 02:35:57.106587       1 endpoints_controller.go:338] Finished syncing service "roamersapp/roamersapp" endpoints. (2.172777ms)
I0322 02:35:57.107370       1 endpoints_controller.go:338] Finished syncing service "kube-lego/kube-lego-nginx" endpoints. (12.568538ms)
I0322 02:35:57.114755       1 endpoints_controller.go:338] Finished syncing service "tvifttt/tviftttapp" endpoints. (8.124249ms)
I0322 02:35:57.115146       1 endpoints_controller.go:338] Finished syncing service "pixelscamp/pixelscamp" endpoints. (16.788809ms)
I0322 02:35:57.115471       1 endpoints_controller.go:338] Finished syncing service "kube-system/kubernetes-dashboard" endpoints. (10.285392ms)
I0322 02:35:57.115757       1 endpoints_controller.go:338] Finished syncing service "gobrpxio/gobrpxio" endpoints. (17.644516ms)
I0322 02:35:57.123492       1 endpoints_controller.go:338] Finished syncing service "tvifttt/tvifttt" endpoints. (15.700531ms)
I0322 02:35:57.123792       1 endpoints_controller.go:338] Finished syncing service "nginx-ingress/nginx" endpoints. (8.58929ms)
I0322 02:35:57.124067       1 endpoints_controller.go:338] Finished syncing service "raster/raster" endpoints. (8.272004ms)
I0322 02:35:57.125199       1 endpoints_controller.go:338] Finished syncing service "kube-system/kube-dns" endpoints. (10.372793ms)
I0322 02:35:57.125961       1 endpoints_controller.go:338] Finished syncing service "probelyapp-staging/probelyapp" endpoints. (10.440307ms)
I0322 02:35:57.126616       1 endpoints_controller.go:338] Finished syncing service "nginx-ingress/default-http-backend" endpoints. (3.044294ms)
I0322 02:35:57.126671       1 endpoints_controller.go:338] Finished syncing service "default/kubernetes" endpoints. (1.293µs)
I0322 02:35:57.127139       1 endpoints_controller.go:338] Finished syncing service "nosslack/nosslack" endpoints. (3.027092ms)
I0322 02:35:57.133203       1 endpoints_controller.go:338] Finished syncing service "raster/redis-master" endpoints. (9.374898ms)
I0322 02:35:57.133581       1 endpoints_controller.go:338] Finished syncing service "probelyapp/probelyapp" endpoints. (8.277871ms)
I0322 02:35:57.133920       1 endpoints_controller.go:338] Finished syncing service "cathode/cathode" endpoints. (7.910074ms)
I0322 02:35:57.265600       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:35:57.797416       1 gc_controller.go:175] GC'ing orphaned
I0322 02:35:57.797532       1 gc_controller.go:195] GC'ing unscheduled pods which are terminating.
I0322 02:35:59.345887       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:36:01.379773       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:36:03.463614       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:36:05.595912       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:36:06.176893       1 reflector.go:273] pkg/controller/resourcequota/resource_quota_controller.go:229: forcing resync
I0322 02:36:06.184813       1 resource_quota_controller.go:153] Resource quota controller queued all resource quota for full calculation of usage
I0322 02:36:06.485604       1 reflector.go:273] pkg/controller/namespace/namespace_controller.go:212: forcing resync
I0322 02:36:07.013565       1 reflector.go:273] pkg/controller/service/servicecontroller.go:174: forcing resync
I0322 02:36:07.246974       1 reflector.go:273] pkg/controller/disruption/disruption.go:326: forcing resync
I0322 02:36:07.247006       1 reflector.go:273] pkg/controller/podautoscaler/horizontal.go:133: forcing resync
I0322 02:36:07.247013       1 reflector.go:273] pkg/controller/disruption/disruption.go:324: forcing resync
I0322 02:36:07.565950       1 reflector.go:273] pkg/controller/petset/pet_set.go:148: forcing resync
I0322 02:36:07.734745       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:36:07.913692       1 reflector.go:273] pkg/controller/informers/factory.go:89: forcing resync
I0322 02:36:07.914049       1 deployment_controller.go:154] Updating deployment tvifttt-interface
I0322 02:36:07.914095       1 deployment_controller.go:154] Updating deployment nosslack
I0322 02:36:07.916338       1 deployment_controller.go:313] Finished syncing deployment "tvifttt/tvifttt-interface" (2.208096ms)
I0322 02:36:07.918029       1 deployment_controller.go:313] Finished syncing deployment "nosslack/nosslack" (1.633283ms)
I0322 02:36:07.918077       1 deployment_controller.go:154] Updating deployment tv-directory
I0322 02:36:07.918099       1 deployment_controller.go:154] Updating deployment kube-dns-v20
I0322 02:36:07.918904       1 deployment_controller.go:313] Finished syncing deployment "tv-directory/tv-directory" (785.707µs)
I0322 02:36:07.920108       1 deployment_controller.go:313] Finished syncing deployment "kube-system/kube-dns-v20" (1.15902ms)
I0322 02:36:07.920160       1 deployment_controller.go:154] Updating deployment pixelscamp
I0322 02:36:07.920183       1 deployment_controller.go:154] Updating deployment nosslackbot
I0322 02:36:07.920998       1 deployment_controller.go:313] Finished syncing deployment "pixelscamp/pixelscamp" (785.307µs)
I0322 02:36:07.927939       1 deployment_controller.go:313] Finished syncing deployment "nosslack/nosslackbot" (6.89615ms)
I0322 02:36:07.928370       1 deployment_controller.go:154] Updating deployment roamersapp
I0322 02:36:07.928412       1 deployment_controller.go:154] Updating deployment probelyapp
I0322 02:36:07.930166       1 deployment_controller.go:313] Finished syncing deployment "roamersapp/roamersapp" (1.732833ms)
I0322 02:36:07.931101       1 deployment_controller.go:313] Finished syncing deployment "probelyapp/probelyapp" (869.765µs)
I0322 02:36:07.931145       1 deployment_controller.go:154] Updating deployment tvifttt
I0322 02:36:07.931165       1 deployment_controller.go:154] Updating deployment cathode
I0322 02:36:07.958935       1 deployment_controller.go:313] Finished syncing deployment "cathode/cathode" (11.43338ms)
I0322 02:36:07.959052       1 deployment_controller.go:154] Updating deployment gustave
I0322 02:36:07.959138       1 deployment_controller.go:154] Updating deployment default-http-backend
I0322 02:36:07.967064       1 deployment_controller.go:313] Finished syncing deployment "gustave/gustave" (7.875261ms)
I0322 02:36:07.967862       1 deployment_controller.go:313] Finished syncing deployment "nginx-ingress/default-http-backend" (564.318µs)
I0322 02:36:07.967928       1 deployment_controller.go:154] Updating deployment redis-master
I0322 02:36:07.967951       1 deployment_controller.go:154] Updating deployment trendex
I0322 02:36:07.968577       1 deployment_controller.go:313] Finished syncing deployment "raster/redis-master" (593.303µs)
I0322 02:36:07.969519       1 deployment_controller.go:313] Finished syncing deployment "trendex/trendex" (897.351µs)
I0322 02:36:07.969568       1 deployment_controller.go:154] Updating deployment kubernetes-dashboard
I0322 02:36:07.969590       1 deployment_controller.go:154] Updating deployment tviftttapp
I0322 02:36:07.970190       1 deployment_controller.go:313] Finished syncing deployment "kube-system/kubernetes-dashboard" (580.909µs)
I0322 02:36:07.987186       1 deployment_controller.go:313] Finished syncing deployment "tvifttt/tviftttapp" (16.95312ms)
I0322 02:36:07.987283       1 deployment_controller.go:154] Updating deployment elpixel
I0322 02:36:07.987335       1 deployment_controller.go:154] Updating deployment nginx
I0322 02:36:07.988266       1 deployment_controller.go:313] Finished syncing deployment "elpixel/elpixel" (900.75µs)
I0322 02:36:07.988941       1 deployment_controller.go:313] Finished syncing deployment "nginx-ingress/nginx" (629.385µs)
I0322 02:36:07.988995       1 deployment_controller.go:154] Updating deployment gobrpxio
I0322 02:36:07.989014       1 deployment_controller.go:154] Updating deployment probelyapp
I0322 02:36:07.990094       1 deployment_controller.go:313] Finished syncing deployment "gobrpxio/gobrpxio" (1.063468ms)
I0322 02:36:08.005891       1 deployment_controller.go:313] Finished syncing deployment "probelyapp-staging/probelyapp" (15.75485ms)
I0322 02:36:08.005946       1 deployment_controller.go:154] Updating deployment tvifttt-delay
I0322 02:36:08.005969       1 deployment_controller.go:154] Updating deployment bitshopping
I0322 02:36:08.007987       1 deployment_controller.go:313] Finished syncing deployment "tvifttt/tvifttt-delay" (1.986384ms)
I0322 02:36:08.008999       1 deployment_controller.go:313] Finished syncing deployment "bitshopping/bitshopping" (964.6µs)
I0322 02:36:08.009048       1 deployment_controller.go:154] Updating deployment worten-ac-simulator
I0322 02:36:08.009070       1 deployment_controller.go:154] Updating deployment raster
I0322 02:36:08.016803       1 deployment_controller.go:313] Finished syncing deployment "worten-ac-simulator/worten-ac-simulator" (7.713193ms)
I0322 02:36:08.018476       1 deployment_controller.go:313] Finished syncing deployment "raster/raster" (1.619805ms)
I0322 02:36:08.018542       1 deployment_controller.go:154] Updating deployment kube-lego
I0322 02:36:08.019279       1 deployment_controller.go:313] Finished syncing deployment "kube-lego/kube-lego" (699.001µs)
I0322 02:36:08.047248       1 deployment_controller.go:313] Finished syncing deployment "tvifttt/tvifttt" (116.059963ms)
I0322 02:36:08.309727       1 reflector.go:273] pkg/controller/disruption/disruption.go:328: forcing resync
I0322 02:36:08.310259       1 reflector.go:273] pkg/controller/disruption/disruption.go:329: forcing resync
I0322 02:36:09.093723       1 reflector.go:273] pkg/controller/disruption/disruption.go:327: forcing resync
I0322 02:36:09.149389       1 reflector.go:273] pkg/controller/volume/persistentvolume/pv_controller_base.go:454: forcing resync
I0322 02:36:09.149436       1 reflector.go:273] pkg/controller/volume/persistentvolume/pv_controller_base.go:455: forcing resync
I0322 02:36:09.150118       1 reflector.go:273] pkg/controller/volume/persistentvolume/pv_controller_base.go:159: forcing resync
I0322 02:36:09.758741       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:36:10.898701       1 reflector.go:392] pkg/controller/volume/persistentvolume/pv_controller_base.go:455: Watch close - *api.PersistentVolumeClaim total 0 items received
I0322 02:36:11.521060       1 namespace_controller.go:206] Finished syncing namespace "default" (596ns)
I0322 02:36:11.521174       1 namespace_controller.go:206] Finished syncing namespace "roamersapp" (298ns)
I0322 02:36:11.521213       1 namespace_controller.go:206] Finished syncing namespace "worten-ac-simulator" (198ns)
I0322 02:36:11.521233       1 namespace_controller.go:206] Finished syncing namespace "gustave" (199ns)
I0322 02:36:11.521269       1 namespace_controller.go:206] Finished syncing namespace "kube-lego" (199ns)
I0322 02:36:11.521289       1 namespace_controller.go:206] Finished syncing namespace "tv-directory" (198ns)
I0322 02:36:11.521306       1 namespace_controller.go:206] Finished syncing namespace "pixelscamp" (198ns)
I0322 02:36:11.521322       1 namespace_controller.go:206] Finished syncing namespace "bitshopping" (198ns)
I0322 02:36:11.521356       1 namespace_controller.go:206] Finished syncing namespace "trendex" (100ns)
I0322 02:36:11.521375       1 namespace_controller.go:206] Finished syncing namespace "tvifttt" (199ns)
I0322 02:36:11.521391       1 namespace_controller.go:206] Finished syncing namespace "kube-system" (99ns)
I0322 02:36:11.521421       1 namespace_controller.go:206] Finished syncing namespace "elpixel" (199ns)
I0322 02:36:11.521439       1 namespace_controller.go:206] Finished syncing namespace "probelyapp-staging" (199ns)
I0322 02:36:11.521456       1 namespace_controller.go:206] Finished syncing namespace "nginx-ingress" (198ns)
I0322 02:36:11.521472       1 namespace_controller.go:206] Finished syncing namespace "raster" (199ns)
I0322 02:36:11.521501       1 namespace_controller.go:206] Finished syncing namespace "cathode" (199ns)
I0322 02:36:11.521518       1 namespace_controller.go:206] Finished syncing namespace "gobrpxio" (199ns)
I0322 02:36:11.521534       1 namespace_controller.go:206] Finished syncing namespace "probelyapp" (199ns)
I0322 02:36:11.521550       1 namespace_controller.go:206] Finished syncing namespace "nosslack" (199ns)
I0322 02:36:11.809629       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:36:13.899391       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:36:15.927454       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:36:17.371577       1 attach_detach_controller.go:540] processVolumesInUse for node "node-3-vm"
I0322 02:36:17.938641       1 gc_controller.go:175] GC'ing orphaned
I0322 02:36:17.938682       1 gc_controller.go:195] GC'ing unscheduled pods which are terminating.
I0322 02:36:17.985692       1 leaderelection.go:203] succesfully renewed lease kube-system/kube-controller-manager
I0322 02:36:18.106790       1 attach_detach_controller.go:540] processVolumesInUse for node "node-0-vm"
I0322 02:36:19.006722       1 reflector.go:392] pkg/controller/volume/persistentvolume/pv_controller_base.go:159: Watch close - *storage.StorageClass total 0 items received
I0322 02:36:19.525498       1 attach_detach_controller.go:540] processVolumesInUse for node "master-0-vm"
I0322 02:36:19.738858       1 nodecontroller.go:738] node node-4-vm hasn't been updated for 40.298744291s. Last ready condition is: {Type:Ready Status:True LastHeartbeatTime:2017-03-22 02:35:34 +0000 UTC LastTransitionTime:2017-03-20 09:52:57 +0000 UTC Reason:KubeletReady Message:kubelet is posting ready status}
I0322 02:36:19.738976       1 nodecontroller.go:765] node node-4-vm hasn't been updated for 40.29886588s. Last out of disk condition is: &{Type:OutOfDisk Status:False LastHeartbeatTime:2017-03-22 02:35:34 +0000 UTC LastTransitionTime:2017-03-20 09:52:57 +0000 UTC Reason:KubeletHasSufficientDisk Message:kubelet has sufficient disk space available}

Relevant example lines:

I0322 02:35:39.440085       1 nodecontroller.go:713] Node node-4-vm ReadyCondition updated. Updating timestamp.
I0322 02:36:19.738858       1 nodecontroller.go:738] node node-4-vm hasn't been updated for 40.298744291s. Last ready condition is: {Type:Ready Status:True LastHeartbeatTime:2017-03-22 02:35:34 +0000 UTC LastTransitionTime:2017-03-20 09:52:57 +0000 UTC Reason:KubeletReady Message:kubelet is posting ready status}

Node-4-vm kubelet:

Mar 22 02:35:40 node-4-vm kubelet-wrapper[996]: I0322 02:35:40.349624     996 operation_executor.go:917] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/99a2bd45-0da7-11e7-9a8a-000d3a2709aa-default-token-k195j" (spec.Name: "default-token-k195j") pod "99a2bd45-0da7-11e7-9a8a-000d3a2709aa" (UID: "99a2bd45-0da7-11e7-9a8a-000d3a2709aa").
Mar 22 02:35:40 node-4-vm kubelet-wrapper[996]: I0322 02:35:40.349755     996 operation_executor.go:917] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/d90c7bb4-0dab-11e7-9a8a-000d3a2709aa-default-token-2bkr0" (spec.Name: "default-token-2bkr0") pod "d90c7bb4-0dab-11e7-9a8a-000d3a2709aa" (UID: "d90c7bb4-0dab-11e7-9a8a-000d3a2709aa").
Mar 22 02:35:42 node-4-vm kubelet-wrapper[996]: I0322 02:35:42.356358     996 operation_executor.go:917] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/fdfc2d54-d0f2-11e6-b156-000d3a2709aa-default-token-7ffjg" (spec.Name: "default-token-7ffjg") pod "fdfc2d54-d0f2-11e6-b156-000d3a2709aa" (UID: "fdfc2d54-d0f2-11e6-b156-000d3a2709aa").
Mar 22 02:35:42 node-4-vm kubelet-wrapper[996]: I0322 02:35:42.358097     996 operation_executor.go:917] MountVolume.SetUp succeeded for volume "kubernetes.io/configmap/fdfc2d54-d0f2-11e6-b156-000d3a2709aa-config-volume" (spec.Name: "config-volume") pod "fdfc2d54-d0f2-11e6-b156-000d3a2709aa" (UID: "fdfc2d54-d0f2-11e6-b156-000d3a2709aa").
Mar 22 02:36:39 node-4-vm kubelet-wrapper[996]: E0322 02:36:39.085953     996 kubelet_node_status.go:302] Error updating node status, will retry: Operation cannot be fulfilled on nodes "node-4-vm": the object has been modified; please apply your changes to the latest version and try again

That kubelet error seems to indicate that it had been updating the node status correctly but someone else updated the value meanwhile at that time (the controller manager). I don’t see any other relevant info in the logs of the kubelets, api server, or etcd nodes. Since several nodes were marked as not ready at the same time, it appears to be something wrong on the controller manager itself.

I did not change the default values of 10s for kubelets to update their node status and 40s for the controller to wait for these updates.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 4
  • Comments: 57 (22 by maintainers)

Commits related to this issue

Most upvoted comments

@edevil Not exclusive to Azure - we see it on AWS.

As discussed on https://github.com/Azure/acs-engine/issues/863#issuecomment-338576088, the bug is gone since Saturday night (France), anyone can confirm here ?

Can anyone tell me the good way to go around this issue ? So far I only see this guy telling the solution:

This is killing us on Azure. Daily it happens to 1-3 nodes for 10-30 minutes. Really wild nothing special in logs. Kubelet has regular /status posts with 200 response to apiserver during the whole time. But when i check the logs or kubectl decribe node i get “kubelet stopped sending statuses” os something similar. This lasts for 10-20 minutes. During which everything seems to work on affected node, so pods, network communications, memory and cpu are all ok.

Since we have redundancy and dont need quick reaction time. We avoided this issue by making the cluster less sensitive

    - --node-monitor-grace-period=30m
    - --pod-eviction-timeout=15m

But I am not really sure about the negative effect of setting those two params.

@petergardfjall … sounds clearly an Azure problem, we have the same problem on the same Datacenter on multiple clusters, starting at the same time … Plz @colemickens @brendanburns do you have some updates or infos on this ?

I open a ticket on azure support right now. Stay tuned

In our case it was the problem of overloaded nodes and pod without limits:

  • Nodes are working near full capacity
  • A pod gets launched which has a heavy cpu usage and with no cpu limits
  • Node is busy serving that particular pod, and does have enough cpu cycles for kubernetes health checks
  • Eventually OS kills the docker process associated with heavy usage, and all goes back to normal.

So what worked for us - limits on every single pod, and no more issues like that.

Yes but this is seen in scale situations where the cloud provider isn’t a problem as well, right ? So it’s a general problem that it’s hard to reason about the heartbeat ; and definetly error handing for cloud providers is part of that

We have the same problem in Azure,

After setting Node Controller logs to V4, this messages start showing up on logs:

Reason:NodeStatusUnknown Message:Kubelet stopped posting node status.

All our pods are being killed,sometimes several times a day…