kubernetes: Pods on node with temporary unknown status never marked ready again
Summary
When a node (kubelet) is unable to heartbeat with the apiserver for a long time, the node-lifecycle controller sets the status of the node to “unknown” and immediately adds the following taint to the node
- effect: NoSchedule
key: node.kubernetes.io/unreachable
timeAdded: "2019-08-03T00:34:24Z"
And immediately marks all pods on the node as NotReady
. This is done here
5 seconds later (this is the frequencey of the node lifecycle contrller’s node health checking loop) the following taint is added
- effect: NoExecute
key: node.kubernetes.io/unreachable
timeAdded: "2019-08-03T00:34:30Z"
Which evicts most pods from the node. However, in some cases the node heartbeats after the first taint is added but before the second taint is added. In this case, pods don’t get evicted from the node however they stay not ready despite the fact that all containers on the pod are ready and the node the pods are running on is now healthy (Ready). This state never recovers without manual intervention, either by marking the pod ready manually, or deleting it if there is a controller that will bring it back.
Expected is that if a pod is scheduled on a healthy node and all containers of the pod are ready, the pod should be marked as ready regardless of edge cases of lifecycle transition of the node.
Repro
One way to repro this is to cause a node to fail to heartbeat for long enough that the node goes into unknown status, but not long enough that the no execute taint is added. I did this by breaking comms with the apiserver with the kubelet using some iptables hacks:
./break-network.sh
#!/bin/bash
# ips the kubelet uses to talk to the apiserver
ips=""
for ip in $ips; do
echo $ip
iptables -I OUTPUT -d $ip -m conntrack --ctstate NEW,RELATED,ESTABLISHED -j DROP
done
and a hacky script to restore ./restore-network.sh
#!/bin/bash
# ips to be restored after running above script
ips=""
for ip in $ips; do
echo $ip
iptables -D OUTPUT 1
done
Note that stopping the kubelet, waiting, then starting the kubelet isn’t a good way to repro this because when the kubelet starts up it repairs the broken (not ready) pods.
The timing of this is pretty hard to get right, because unless you tolerate the no execute taint, 5 seconds after the node is unreachable, the no execute taint gets added and the pod gets evicted. So I wrote this script that heartbeats on behalf of the node the moment it becomes Unreachable to simulate the kubelet doing it. After this fake heartbeart, fix the kubelet network and stop the script and you’ll see the pod never becomes ready again.
You can also repro this by just tolerating the no execute taint, then causing the heartbeat to the apiserver to fail, then fixing it and see that the pod never becomes ready again. I wanted to include the other repro however because it’s more realistic.
Impact
This occurred in production for us and led to a customer outage. In our case, we had a node that was intermittently unable to heartbeat, and the failure wound up cascading in this way that led to a full cluster dns outage for all 3 coredns pods we were running.
Versions
k8s: 1.14.3
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 30
- Comments: 55 (19 by maintainers)
Commits related to this issue
- fix issue #80968 Pods on node with temporary unknown status never marked ready again — committed to ZhengRongTan/kubernetes by ZhengRongTan 4 years ago
It looks like we have the same problem with 1.15.3.
Maybe the following two snippets help other people to find their affected pods faster.
In case you want to alert based on this until there is a solution:
Or if you do not have access to Prometheus, you can use
jq
:I have not been able to reproduce it by the script. is there anyone same with me?
/remove-lifecycle stale
The patch in PR https://github.com/kubernetes/kubernetes/pull/83455 fixes this issue for us.
As a temporary measure, I just hacked this https://github.com/multi-io/kube-pod-update-status to fix pods broken by this in a cluster (without having to restart them all). Use with care.
This occurred again in Kubernetes 1.16.2
@tedyu my guess is that the kubelet should reevaluate pod readiness when the node becomes ready again. Does that sound right?
Reason being, the kube-controller-manager has the logic you pointed out to explicitly
MarkAllPodsNotReady
when the node becomes unresponsive, but it doesn’t appear that anything actually goes and decides whether or not the pods should become ready again if the node comes back.