kubernetes: Pods on node with temporary unknown status never marked ready again

Summary

When a node (kubelet) is unable to heartbeat with the apiserver for a long time, the node-lifecycle controller sets the status of the node to “unknown” and immediately adds the following taint to the node

  - effect: NoSchedule
    key: node.kubernetes.io/unreachable
    timeAdded: "2019-08-03T00:34:24Z"

And immediately marks all pods on the node as NotReady. This is done here

5 seconds later (this is the frequencey of the node lifecycle contrller’s node health checking loop) the following taint is added

  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    timeAdded: "2019-08-03T00:34:30Z"

Which evicts most pods from the node. However, in some cases the node heartbeats after the first taint is added but before the second taint is added. In this case, pods don’t get evicted from the node however they stay not ready despite the fact that all containers on the pod are ready and the node the pods are running on is now healthy (Ready). This state never recovers without manual intervention, either by marking the pod ready manually, or deleting it if there is a controller that will bring it back.

Expected is that if a pod is scheduled on a healthy node and all containers of the pod are ready, the pod should be marked as ready regardless of edge cases of lifecycle transition of the node.

Repro

One way to repro this is to cause a node to fail to heartbeat for long enough that the node goes into unknown status, but not long enough that the no execute taint is added. I did this by breaking comms with the apiserver with the kubelet using some iptables hacks:

./break-network.sh

#!/bin/bash

# ips the kubelet uses to talk to the apiserver
ips=""

for ip in $ips; do
    echo $ip
    iptables -I OUTPUT -d $ip -m conntrack --ctstate NEW,RELATED,ESTABLISHED -j DROP
done

and a hacky script to restore ./restore-network.sh

#!/bin/bash

# ips to be restored after running above script
ips=""

for ip in $ips; do
    echo $ip
    iptables -D OUTPUT 1
done

Note that stopping the kubelet, waiting, then starting the kubelet isn’t a good way to repro this because when the kubelet starts up it repairs the broken (not ready) pods.

The timing of this is pretty hard to get right, because unless you tolerate the no execute taint, 5 seconds after the node is unreachable, the no execute taint gets added and the pod gets evicted. So I wrote this script that heartbeats on behalf of the node the moment it becomes Unreachable to simulate the kubelet doing it. After this fake heartbeart, fix the kubelet network and stop the script and you’ll see the pod never becomes ready again.

You can also repro this by just tolerating the no execute taint, then causing the heartbeat to the apiserver to fail, then fixing it and see that the pod never becomes ready again. I wanted to include the other repro however because it’s more realistic.

Impact

This occurred in production for us and led to a customer outage. In our case, we had a node that was intermittently unable to heartbeat, and the failure wound up cascading in this way that led to a full cluster dns outage for all 3 coredns pods we were running.

Versions

k8s: 1.14.3

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 30
Comments: 55 (19 by maintainers)

Commits related to this issue

fix issue #80968 Pods on node with temporary unknown status never marked ready again — committed to ZhengRongTan/kubernetes by ZhengRongTan 4 years ago

Most upvoted comments

It looks like we have the same problem with 1.15.3.

Maybe the following two snippets help other people to find their affected pods faster.

In case you want to alert based on this until there is a solution:

- alert: PodUnreadyWithAllContainersReady
expr: >
  (count by (namespace, pod) (kube_pod_status_ready{condition="true"} == 0))
  and
  (
    (count by (namespace, pod) (kube_pod_container_status_ready==1))
    unless
    (count by (namespace, pod) (kube_pod_container_status_ready==0))
  )
for: 5m
annotations:
  message: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} is unready even though all its containers are ready.'
labels:
  severity: critical

Or if you do not have access to Prometheus, you can use jq:

kubectl get po -A -o json | jq -r '.items[] | {"name": .metadata.name, "node": .spec.nodeName, "ContainersReady": .status.conditions[] | select(.type|contains("ContainersReady")) | .status, "Ready": .status.conditions[] | select(.type|contains("Ready")) | .status } | select(.Ready=="False" and .ContainersReady=="True")'

+10

ekeih on Oct 21, 2019

I wasn’t able to repro as above 100% of the time. Sometimes, if the kubelet has an unrelated update for the pod it will update its status as Ready. The repro worked most of the time and this is still a bug as far as I can tell because there isn’t a guarantee (nor is it even likely) that the pod will be ready in the steady state once the kubelet returns to being healthy

I have not been able to reproduce it by the script. is there anyone same with me?

qmloong on Jul 8, 2021

/remove-lifecycle stale

bootc on Jul 18, 2020

The patch in PR https://github.com/kubernetes/kubernetes/pull/83455 fixes this issue for us.

bashofmann on Nov 19, 2019

As a temporary measure, I just hacked this https://github.com/multi-io/kube-pod-update-status to fix pods broken by this in a cluster (without having to restart them all). Use with care.

multi-io on Nov 15, 2019

This occurred again in Kubernetes 1.16.2

ericsuhong on Oct 29, 2019

@tedyu my guess is that the kubelet should reevaluate pod readiness when the node becomes ready again. Does that sound right?

Reason being, the kube-controller-manager has the logic you pointed out to explicitly MarkAllPodsNotReady when the node becomes unresponsive, but it doesn’t appear that anything actually goes and decides whether or not the pods should become ready again if the node comes back.

alexodle on Aug 14, 2019