kubernetes: Sometime Liveness/Readiness Probes fail because of net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
What happened: In my cluster sometimes readiness the probes are failing. But the application works fine.
Apr 06 18:15:14 kubenode** kubelet[34236]: I0406 18:15:14.056915 34236 prober.go:111] Readiness probe for "default-nginx-daemonset-4g6b5_default(a3734646-77fd-11ea-ad94-509a4c9f2810):nginx" failed (failure): Get http://172.18.123.127:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
What you expected to happen: Successful Readiness Probe.
How to reproduce it (as minimally and precisely as possible): We have few clusters with different workloads. Only in cluster with big number of short living pods we have this issue. But not on all nodes. We can’t reproduce this error on other our clusters (that have same configuration but different workload). How i found the issue? I deployed a daemonset:
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: default-nginx-daemonset
namespace: default
labels:
k8s-app: default-nginx
spec:
selector:
matchLabels:
name: default-nginx
template:
metadata:
labels:
name: default-nginx
spec:
tolerations:
- operator: Exists
containers:
- name: nginx
image: nginx:latest
resources:
limits:
cpu: "1"
memory: "1Gi"
requests:
cpu: "1"
memory: "1Gi"
readinessProbe:
httpGet:
path: /
port: 80
Then i started to listen events on all pods of this daemonset. After a couple of time i received next events:
Warning Unhealthy 110s (x5 over 44m) kubelet, kubenode20 Readiness probe failed: Get http://172.18.122.143:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 11m (x3 over 32m) kubelet, kubenode10 Readiness probe failed: Get http://172.18.65.57:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
....
Those events where on ~50% of pods of this daemonset.
On the nodes where the pods with failed probes was placed, I collected the logs of kubelet. And there was errors like:
Apr 06 14:08:35 kubenode20 kubelet[10653]: I0406 14:08:35.464223 10653 prober.go:111] Readiness probe for "default-nginx-daemonset-nkwkf_default(90a3883b-77f3-11ea-ad94-509a4c9f2810):nginx" failed (failure): Get http://172.18.122.143:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I was thinking that sometimes the nginx in pod really response slowly. For excluding this theory, I created a short script that curl the application in pod and store response time in a file:
while true; do curl http://172.18.122.143:80/ -s -o /dev/null -w "%{time_starttransfer}\n" >> /tmp/measurment.txt; done;
I run this script on node where the pod is placed for 30 minutes and i get the following:
$ cat /tmp/measurment.txt | sort -u
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
$ cat /tmp/measurment.txt | wc -l
482670
There was 482670
measurements and the longest response time was 0.007
.
In logs of pod there are only message with response code 200 (from my requests and requests of readiness probes):
[06/Apr/2020:14:06:30 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
......
[06/Apr/2020:14:08:35 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.47.0" "-"
[06/Apr/2020:14:08:35 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.47.0" "-"
[06/Apr/2020:14:08:35 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.47.0" "-"
......
[06/Apr/2020:14:08:41 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
It means that part of probes are successful.
Then i stopped the curl script (because the big number of logs). I waited while new error with failed probe appears in kubelet logs.
Apr 06 18:15:14 kubenode13 kubelet[34236]: I0406 18:15:14.056915 34236 prober.go:111] Readiness probe for "default-nginx-daemonset-4g6b5_default(a3734646-77fd-11ea-ad94-509a4c9f2810):nginx" failed (failure): Get http://172.18.123.127:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
And in logs of that pod with nginx I didn’t find the request generated by this probe:
kubectl logs default-nginx-daemonset-4g6b5 | grep "15:15"
[06/Apr/2020:18:15:00 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
[06/Apr/2020:18:15:20 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
[06/Apr/2020:18:15:30 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
[06/Apr/2020:18:15:40 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
[06/Apr/2020:18:15:50 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
If I restart the kubelet - the error don’t disappear. Have someone any suggestions about this?
Environment:
- Kubernetes version: 1.12.1
- Cloud provider or hardware configuration: **hardware
- OS (e.g:
cat /etc/os-release
): ubuntu 16.04 - Kernel (e.g.
uname -a
): 4.15.0-66-generic - Install tools:
- Network plugin and version (if this is a network-related bug): calico:v3.1.3
Seems like the problem appears in many different installations - https://github.com/kubernetes/kubernetes/issues/51096
/sig network
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 139
- Comments: 149 (39 by maintainers)
Commits related to this issue
- Allow for probe timeouts to be configurable When setting up a new Teleport enterprise cluster on GCP, I noticed that I needed to set the probe timeouts to get the cluster to be helthy. This seems to ... — committed to hmadison/teleport by hmadison 2 years ago
- Allow for probe timeouts to be configurable When setting up a new Teleport enterprise cluster on GCP, I noticed that I needed to set the probe timeouts to get the cluster to be helthy. This seems to ... — committed to gravitational/teleport by hmadison 2 years ago
- helm: Allow probe timeouts to be configurable (buddy merge of #11176) (#11396) * Allow for probe timeouts to be configurable When setting up a new Teleport enterprise cluster on GCP, I noticed th... — committed to gravitational/teleport by webvictim 2 years ago
- helm: Allow probe timeouts to be configurable (buddy merge of #11176) (#11396) * Allow for probe timeouts to be configurable When setting up a new Teleport enterprise cluster on GCP, I noticed that ... — committed to gravitational/teleport by webvictim 2 years ago
- [v9] helm: Backports (#11728) * helm: Buddy merge for #11493 (#11693) * examples/chart/teleport-cluster/templates/deployment.yaml: set strategy to 'Recreate' if chartMode is 'standalone' (fixes #1... — committed to gravitational/teleport by webvictim 2 years ago
Our production issues are now fully resolved.
Thank you everyone for sharing your solutions!!!
We’re running on latest EKS, K8s version 1.20.
Our solution
httpGet
probes with basic exec + curl (see below).httpGet
fast enough to check the health status of a pod. This makesk8s
“think” that a few pods are unhealthy, so k8s kills them and spins up new ones to replace. This makes the problem worse, because while a few pods are unhealthy the rest have to bear the load. On top of that the pod startup is CPU intensive in our case, it takes about 45 seconds (I know - slow).exec
probe runs in an existing running pod/container, so it’s very efficient and lightweight.livenessProbe
, tweak timings ofreadinessProbe
to continuously monitor pods health, and usestartupProbe
instead ofreadinessProbe
.readinessProbe
is only used for pod startup and once the pod is started and healthy -livenessProbe
takes over and monitors its “liveness” status. It turns out, I was completely wrong: https://stackoverflow.com/questions/65858309/why-do-i-need-3-different-kind-of-probes-in-kubernetes-startupprobe-readinesspstartupProbe
is what I needed to wait for the pod to start up - duh.readinessProbe
is used to continuously monitor the pod, and make sure it is “ready” to receive HTTP requests (in our case).livenessProbe
is only needed when the application is incapable of dying on its own (Erland, Elixir?). Our app is Rails, so if it dies - the container process is killed and the non-zero status code. In this case K8s simply starts a new pod automatically, no need to check “liveness”.Health check probes
The
-o /dev/null
is needed to redirect the page HTML, so that we don’t pollute our logs with it.Conclusion
It’s great to see the system purring peacefully, even under high load:
any update on this? we have been facing similar issues since few weeks
Adding to this: (@cseifert-betterdoc and me both work on the same issue within our K8s / EKS cluster, so the version etc. poste apply to this)
I think we have a proof that there is a networking issue between the Kubelet and the container.
Our Problem
We have had
We have had problems with our cluster, as is restarted contains seemingly random, giving failing health checks as reason.
Of course, we started to dig into why our container fails (as, we all know, “SQL is not broken”). After a lot of deep and thought digging into metrics and logs, we came to the conclusion that our container (running a rail based web app) is just fine.
What we did
We concluded that our network setup is somehow flawed (being strange, given that we use EKS for a reason…), so we switched the health check to a simple
exec
command to see if this mitigates the problem.We left the readiness endpoint alone, so this is still using HTTP.
What we see now
Now, we see readiness probe failures, but no health check failures any more.
In addition, we can see that our app is happily serving http request, while we see times where the readiness endpoint is not called.
Our conclusion (so far)
Our view on this:
This means that the
kubelet
, for some reason, is not able to send readiness probe http request to the container for some seconds, and then starts to get them trough later.At the same time, request coming in from an external load balancer are still served.
As fare as we understand, the
kubelet
process runs on the node, so this looks like the networking for “node-local” request is unreliable.This looks like a serious issue to us, as the health checks killed our app without good reason, repeatedly.
I understand that this problem looks a lot like people are using K8S for unstable apps and blame the platform instead of doing their homework — but, as far as I can tell, what we see proves that this is not the case for us and that there is a problem.
It would be great to get some feedback on this issue – let me know if there is a misguided thought. This is complex stuff, after all. 😃
We have replaced the
httpGet
with anexec
doing basically the same thing, which so far works without any problems.Observed same issue with (k8s v1.22.1,dataceneter,Rhel 8.3) and timeoutSeconds: 10 resolved our issue.
We have reviewed our findings on this, and here is my update:
kubelet
, the gaps were due to some unrelated issue with our logging back end.We have had no outages any more since switching the health checks to an
exec
based health check, still hinting a problem with http related checks.Also, we have switched from
namedPort
tosimple
port, but still see the readiness endpoint failing, so this does not influence the failure of http probes for us.We will now switch to exec based readiness probes and post if this helps.
Sorry for the noise, but I hope this helps to clarify the situation for people having this problem.
@Snehil03 - we just moved to use the
exec
probe to replace thehttpGet
. Here is how it looks for one of our containers:It eliminated the error for us.
Also I suggest to rename the topic of this issue from “Sometimes” to “VERY frequently” which a) reflects reality better and b) might move this issue up the priority list because I find it kind of annoying that Kubernetes probes have been so flaky and unreliable for such a long time now without anyone making any real progress here.
@SergeyKanzhelev @dchen1107 @derekwaynecarr @mrunalp
It seems really problematic if users are abandoning liveness probes because they are not reliable. We should not ignore this.
Do we need to find a way to give the probers more CPU scheduling time? Higher realtime prio? Maybe a distinct process from kubelet?
We recently upgraded multiple k8s clusters to v1.20 and are now having this issue occur at a much higher rate than previously (on v1.18).
^You can see exactly when we started the upgrades this week. Those are k8s events that match
context deadline exceeded
.Yes, facing the same problem with a 1.20.2 cluster.
/triage accepted
Are you sure the application don’t hit the resources limits?
In my case, the application starting fine, then the container start using more resources until he hit the limit. After that the readiness probes fail
We are experiencing the same issue on
v1.21.12
. If thisexec
workaround mentioned above works, it comes with the downside that not all application containers would have curl installed. Some of them might be distroless. And some might be upstream containers that we don’t own and we can’t easily install another library like curl. So it would ultimately still be nice to find out why the kubelet runs into these issue.What happened: After updating EKS from v1.14.9-eks-f459c0 to v1.15.11-eks-065dce some of our ReplicaSets and DaemonSets Liveness/Readiness probes started failing with
Liveness probe failed: Get http://xxx.xxx.xxx.xxx:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers
It seems that random pods are affected. I can successfully curl to them from other pods.
All nodes and pods are fine, not CPU/Memory/Storage issues.
Two nodes are running 1.15 and the rest 1.14. There doesn’t seem to be a correlation between node version and the issue.
Nothing obvious in
kubectl cluster-info dump
orjournalctl -r
orkube-proxy.log
Probes are mostly configured like this (truncated):
I will increase the timeouts but it would be good to solve this. What changed in 1.15?
What you expected to happen: Liveness/Readiness probes not to fail.
How to reproduce it (as minimally and precisely as possible): Upgrade EKS 1.14 to 1.15
Environment:
kubectl version
):cat /etc/os-release
): 1.14:1.15
I did 1 and 2 XD #115143
for me it worked after increasing the livenessprobe.timeoutSeconds to 15 seconds and the liveness probe errors stopped. seems like my app is taking a lot oftime to respond with https and slow network and low CPU resources. Thanks for the suggestion.
On v1.21.2 AKS - same issue. Resolved using @mayanksinghse’s fix increasing timeoutSeconds to 10. Default-backend has timeoutSeconds set to 5 by default. Will attempt to decrease from 10 and look for anything useful.
Odd issue since it’s only impacting single cluster – others with same config intermittently show the same error but do not go into CrashLoop.
Somebody pointed me to this link: https://cloud.tencent.com/developer/article/1583706 (It’s in Chinese, but most important information is in English)
It’s recommended to insert this podConfig to your pod template:
The most important option is
single-request-reopen
.kubectl get event --all-namespaces
shows some pods fail to response readinessProbe time to time:I identified the cause was the pod spent 5s to do dns lookup, further investigation showed the node where the failing pod ran on lost incoming udp packet time to time, so the pod resent dns query 5s later but the readinessProbe timeout is exactly 5s.
After reboot the k8s node where the failing pod ran on, and make sure coredns pods are not scheduled to these suspicious nodes, the issue disappeared.
I guess the physical machine where the k8s node is on has traffic congestion sometimes, or linux kernel has some bug that drops incoming udp packet.
@thockin Conntrack shows hardly 2 or 3 errors. Memory consumption is 60-65% per node.
Just found that the timeouts were for almost all request and not just probe. We added istio lately to check connection stats and understand if the behaviour was due to application. But the findings were weird. Istio itself is now failing readiness probe quite frequently
157 failures in ~3 hours
Would like to add that kubernetes is running in our Datacenter. And since iptables version is 1.4.21, --random-fully is not being implemented. But since all machines have same configuration, we ruled out this possibility
deployment
pprof shows a considerable amount of time dialing/http probing
I have to compare against
exec probes
tooChecking my machine default values:
Resources:
Timers:
3 probes per container every second create a 3 TCP connections per second (it has to wait 60 second on TIME-WAIT state holding the socket) that consume an ephemeral port (we have 28231 per tuple) and a conntrack entry. Once the TIME-WAIT timer expires it frees the conntrack entry and the socket.
In the example above, we create a connections at a rate of 3 per pod with 100 pods = 300 connections per second, that will be freed after 60 seconds.
So, we’ll have a constant number of 59x300 = 17700 sockets in TIME-WAIT and conntrack entries just for having the probes
that doesn’t seem to be worrisome if we have
262144
conntrack entries, but it leaves only 28231 - 17700 ephemeral ports for things like DNS and any other hostNetwork stuffI’ve seen this behaviour with on prem clusters on 1.19 and now with 1.20 on large bare metal nodes. In our findings so far it seems if the pod fs is slow the http probes will fail at intervals. The more pvc’s attached the the pod the more likely something like
time df -h
will show a value over a few secs. If there is no pvc’s attached we dont see any http probe failures.Now facing this mainly for the tiller pod (helm), all other pods are generally working fine. I did shutdown all the EKS worker nodes and then recreated, that seems to solve the issue. Looks like a kubernetes 1.16 issue to me.
experiencing this issue in multiple GKE clusters. just started about 5 days ago. right now i’m having to drain the bad node and remove it from the pool. running
1.16.9-gke.6
@thockin
returns-console-558995dd78-b6bjm 0/1 Running 0 23h
For example, if you see the above get pods,
readiness probe
of one pod is failing almost every day. A restart would fix this. However, we’re unable to find the root cause. But I don’t see any abnormal numbers on CPU or memory or thread count. Doing adescribe pod
would give me thatreadiness
has failed. How do we debug in such scenario?Also, this is happening for this particular deployment only. The other pods are running fine.
“returns-console-558995dd78-pbjf8 1/1 Running 0 23h returns-console-558995dd78-plqdr 1/1 Running 0 23h returns-console-558995dd78-tlb5k 1/1 Running 0 23h returns-console-558995dd78-tr2kd 1/1 Running 0 23h returns-console-558995dd78-vfj2n 1/1 Running 0 23h”
That’s a can of worms, http1 Vs http2 thing is complex , let’s this soak and revisit one we are able to get some data about the impact of this change
let me see how far I go with 1
Experiencing this problem in production level application. I am running a Node.js application on Kubernetes and liveness probe keeps killing the application time to time, usually on the high load. There was no problem with
httpGet
or2 timeoutSeconds in probes
without high load.Initial settings
/health
endpoint which just returns200 ok
.httpGet
withtimeoutSeconds: 2
What I’ve done
keep-alive
settingshttpGet
toexec: command: curl ...
thingAnd it is still happening. Now I’m about to remove liveness probe after finding out this comment. I guess readiness probe is enough for Node.js app.
Has anyone ever run into issues where the probes fail occasionally with EOF (what does an EOF imply, no response?)? I have been able to confirm that the service is responsive and the
/health
endpoint works but would fail from time to time with a connection refused.Thanks
Thank you so much for help! My problem did solved with simple exec curl, fantastic!! Thanks for all and especial for @alex-kovshovik @pskrzyns
Environment: Amazon EKS, K8s version
1.19
, AMI release version1.19.6-20210628
, AMI typeAL2_x86_64
, instance typec5a.2xlarge
We’re having the same issue since Monday this week, tried everything, but one interesting observation is: the frequency of the occurrence is a lot higher when the nodes have significant memory usage… This could be our app leaking memory, but in that case I would expect K8s to let the app use up to 99% of
memory limit
and then “Evict” the pod and start a new one. This would’ve been a totally fine behavior, but instead we’re getting:Our nodes all have 16 Gb of RAM, but Datadog monitoring shows that the memory usage exceeds the total available memory somehow:
If the host OS is swapping memory - this could explain why the application processes the requests extremely slowly, slower than liveness probe timeout value (I increased it from 1 to 3 seconds to compensate).
But why would the nodes be swapping memory? I thought K8s does not allocate pods if there isn’t enough physical memory available?
Hello @eldada , How did you managed to get these issue fixed ? Could it be possible to paste the entire block changed ? I am using calico so need to check for the changes.
Same here, I’ve noticed the same issue in my GKE cluster. Don’t know when it started, but it happens for many containers which I haven’t updated in more than a year and now they have a lot of probes failing whereas before they never did, so it’s certainly not the container.
I’m getting same issue. Couple Nginx+PHP Pods running on huge instances in parallel with couple other small applications. These are staging apps+nodes without constant traffic. I constantly receive notification that these Nginx+PHP app has restarted… specifically Nginx container of these Pods. At the same time other apps running in the same namespace, nodes never restart.
Liveness on an Nginx container looks like this:
UPD: Strange thing is that completely distinct deployments you can see above staging4 staging5 stagingN - above 10 deployments fail at once.
My possible problem might be in missing
timeoutSeconds
which is default1s
This is the first such report I have seen. There’s nothing obvious about why this would happen.
It’s possible kubelet is too busy and starved for CPU and the probe happened to be thing that got thrashed. How many pods are on this machine? How busy is it?
It’s possible the node itself is thrashing or something and OOM behavior is weird. Does dmesg show any OOMs?
It’s possible some other failure down deep in kubelet is being translated into this? You could try running kubelet at a higher log level to get more details on what is happening.
A lot of bugs have been fixed since 1.12, so we’d need to try to reproduce this and then try again in a more recent version. Is there any way you can help boil down a simpler reproduction?