kubernetes: Sometime Liveness/Readiness Probes fail because of net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

What happened: In my cluster sometimes readiness the probes are failing. But the application works fine.

Apr 06 18:15:14 kubenode** kubelet[34236]: I0406 18:15:14.056915   34236 prober.go:111] Readiness probe for "default-nginx-daemonset-4g6b5_default(a3734646-77fd-11ea-ad94-509a4c9f2810):nginx" failed (failure): Get http://172.18.123.127:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

What you expected to happen: Successful Readiness Probe.

How to reproduce it (as minimally and precisely as possible): We have few clusters with different workloads. Only in cluster with big number of short living pods we have this issue. But not on all nodes. We can’t reproduce this error on other our clusters (that have same configuration but different workload). How i found the issue? I deployed a daemonset:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: default-nginx-daemonset
  namespace: default
  labels:
    k8s-app: default-nginx
spec:
  selector:
    matchLabels:
      name: default-nginx
  template:
    metadata:
      labels:
        name: default-nginx
    spec:
      tolerations:
      - operator: Exists
      containers:
      - name: nginx
        image: nginx:latest
        resources:
          limits:
            cpu: "1"
            memory: "1Gi"
          requests:
            cpu: "1"
            memory: "1Gi"
        readinessProbe:
          httpGet:
            path: /
            port: 80

Then i started to listen events on all pods of this daemonset. After a couple of time i received next events:

Warning  Unhealthy  110s (x5 over 44m)  kubelet, kubenode20  Readiness probe failed: Get http://172.18.122.143:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning  Unhealthy  11m (x3 over 32m)  kubelet, kubenode10  Readiness probe failed: Get http://172.18.65.57:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
....

Those events where on ~50% of pods of this daemonset.

On the nodes where the pods with failed probes was placed, I collected the logs of kubelet. And there was errors like:

Apr 06 14:08:35 kubenode20 kubelet[10653]: I0406 14:08:35.464223   10653 prober.go:111] Readiness probe for "default-nginx-daemonset-nkwkf_default(90a3883b-77f3-11ea-ad94-509a4c9f2810):nginx" failed (failure): Get http://172.18.122.143:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

I was thinking that sometimes the nginx in pod really response slowly. For excluding this theory, I created a short script that curl the application in pod and store response time in a file:

while true; do curl http://172.18.122.143:80/ -s -o /dev/null -w  "%{time_starttransfer}\n" >> /tmp/measurment.txt; done;

I run this script on node where the pod is placed for 30 minutes and i get the following:

$ cat /tmp/measurment.txt | sort -u
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007

$ cat /tmp/measurment.txt | wc -l
482670

There was 482670 measurements and the longest response time was 0.007.

In logs of pod there are only message with response code 200 (from my requests and requests of readiness probes):

[06/Apr/2020:14:06:30 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
......
[06/Apr/2020:14:08:35 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.47.0" "-"
[06/Apr/2020:14:08:35 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.47.0" "-"
[06/Apr/2020:14:08:35 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.47.0" "-"
......
[06/Apr/2020:14:08:41 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"

It means that part of probes are successful.

Then i stopped the curl script (because the big number of logs). I waited while new error with failed probe appears in kubelet logs.

Apr 06 18:15:14 kubenode13 kubelet[34236]: I0406 18:15:14.056915   34236 prober.go:111] Readiness probe for "default-nginx-daemonset-4g6b5_default(a3734646-77fd-11ea-ad94-509a4c9f2810):nginx" failed (failure): Get http://172.18.123.127:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

And in logs of that pod with nginx I didn’t find the request generated by this probe:

kubectl logs default-nginx-daemonset-4g6b5 | grep "15:15"
[06/Apr/2020:18:15:00 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
[06/Apr/2020:18:15:20 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
[06/Apr/2020:18:15:30 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
[06/Apr/2020:18:15:40 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
[06/Apr/2020:18:15:50 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"

If I restart the kubelet - the error don’t disappear. Have someone any suggestions about this?

Environment:

  • Kubernetes version: 1.12.1
  • Cloud provider or hardware configuration: **hardware
  • OS (e.g: cat /etc/os-release): ubuntu 16.04
  • Kernel (e.g. uname -a): 4.15.0-66-generic
  • Install tools:
  • Network plugin and version (if this is a network-related bug): calico:v3.1.3

Seems like the problem appears in many different installations - https://github.com/kubernetes/kubernetes/issues/51096

/sig network

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 139
  • Comments: 149 (39 by maintainers)

Commits related to this issue

Most upvoted comments

Our production issues are now fully resolved.

Thank you everyone for sharing your solutions!!!

We’re running on latest EKS, K8s version 1.20.

Our solution

  1. Replace httpGet probes with basic exec + curl (see below).
    • This was suggested by a few people above.
    • My guess is when the nodes are under high load, kubelet process couldn’t spin up httpGet fast enough to check the health status of a pod. This makes k8s “think” that a few pods are unhealthy, so k8s kills them and spins up new ones to replace. This makes the problem worse, because while a few pods are unhealthy the rest have to bear the load. On top of that the pod startup is CPU intensive in our case, it takes about 45 seconds (I know - slow).
    • exec probe runs in an existing running pod/container, so it’s very efficient and lightweight.
  2. Remove livenessProbe, tweak timings of readinessProbe to continuously monitor pods health, and use startupProbe instead of readinessProbe.
    • This was my misunderstanding of probes: I used to think that readinessProbe is only used for pod startup and once the pod is started and healthy - livenessProbe takes over and monitors its “liveness” status. It turns out, I was completely wrong: https://stackoverflow.com/questions/65858309/why-do-i-need-3-different-kind-of-probes-in-kubernetes-startupprobe-readinessp
    • It turns out, startupProbe is what I needed to wait for the pod to start up - duh.
    • readinessProbe is used to continuously monitor the pod, and make sure it is “ready” to receive HTTP requests (in our case).
    • livenessProbe is only needed when the application is incapable of dying on its own (Erland, Elixir?). Our app is Rails, so if it dies - the container process is killed and the non-zero status code. In this case K8s simply starts a new pod automatically, no need to check “liveness”.

Health check probes

startupProbe:
  httpGet:
    path: /user/sign_in
    port: 3000
  initialDelaySeconds: 30
  periodSeconds: 5
  failureThreshold: 30
  successThreshold: 1
  timeoutSeconds: 1
readinessProbe:
  exec:
    command:
      - "curl"
      - "--fail"
      - "-o"
      - "/dev/null"
      - "http://localhost:3000/user/sign_in"
    periodSeconds: 10
    failureThreshold: 3
    successThreshold: 1
    timeoutSeconds: 1

The -o /dev/null is needed to redirect the page HTML, so that we don’t pollute our logs with it.

Conclusion

It’s great to see the system purring peacefully, even under high load:

CleanShot 2021-07-08 at 09 33 38@2x

any update on this? we have been facing similar issues since few weeks

Adding to this: (@cseifert-betterdoc and me both work on the same issue within our K8s / EKS cluster, so the version etc. poste apply to this)

I think we have a proof that there is a networking issue between the Kubelet and the container.

Our Problem

We have had

  • an http based health check in place; and
  • a http-based readyness endpoint.

We have had problems with our cluster, as is restarted contains seemingly random, giving failing health checks as reason.

Of course, we started to dig into why our container fails (as, we all know, “SQL is not broken”). After a lot of deep and thought digging into metrics and logs, we came to the conclusion that our container (running a rail based web app) is just fine.

What we did

We concluded that our network setup is somehow flawed (being strange, given that we use EKS for a reason…), so we switched the health check to a simple exec command to see if this mitigates the problem.

We left the readiness endpoint alone, so this is still using HTTP.

What we see now

Now, we see readiness probe failures, but no health check failures any more.

In addition, we can see that our app is happily serving http request, while we see times where the readiness endpoint is not called.

Requests

Our conclusion (so far)

Our view on this:

  1. we see that the container is reachable and responsive;
  2. at the same time, the readiness check endpoint is NOT being served by the app.

This means that the kubelet, for some reason, is not able to send readiness probe http request to the container for some seconds, and then starts to get them trough later.

At the same time, request coming in from an external load balancer are still served.

As fare as we understand, the kubelet process runs on the node, so this looks like the networking for “node-local” request is unreliable.

This looks like a serious issue to us, as the health checks killed our app without good reason, repeatedly.


I understand that this problem looks a lot like people are using K8S for unstable apps and blame the platform instead of doing their homework — but, as far as I can tell, what we see proves that this is not the case for us and that there is a problem.

It would be great to get some feedback on this issue – let me know if there is a misguided thought. This is complex stuff, after all. 😃

We have replaced the httpGet with an exec doing basically the same thing, which so far works without any problems.

  readinessProbe:
#    httpGet:
#      path: "/__META__/ready"
#      port: 3000
    exec:
      command:
        - "curl"
        - "--fail"
        - "http://localhost:3000/__META__/ready"

Observed same issue with (k8s v1.22.1,dataceneter,Rhel 8.3) and timeoutSeconds: 10 resolved our issue.


         initialDelaySeconds: 10
          timeoutSeconds: 10
          periodSeconds: 10
          failureThreshold: 10

We have reviewed our findings on this, and here is my update:

  1. The ready endpoint is called by the kubelet, the gaps were due to some unrelated issue with our logging back end.
  2. We have reviewed all calls and the response time for all readiness checks, and the were well within the expected time range

We have had no outages any more since switching the health checks to an exec based health check, still hinting a problem with http related checks.

Also, we have switched from namedPort to simple port, but still see the readiness endpoint failing, so this does not influence the failure of http probes for us.

We will now switch to exec based readiness probes and post if this helps.


Sorry for the noise, but I hope this helps to clarify the situation for people having this problem.

Seeing a similar behavior in an AKS cluster (Azure). Kubernetes v1.16.13 OS and Kernel Ubuntu 16.04.7 LTS 4.15.0-1103-azure Since this is a managed Kubernetes, we don’t have much visibility into the managed part, but I can say we are trying to utilise the nodes well, so we pack them up with a lot of pods (up to about 50), that have many containers, so I might have ~200 containers running in a node. Replacing httpGet with a simple exec running curl resolved it completely!

httpGet

Hello @eldada , How did you managed to get these issue fixed ? Could it be possible to paste the entire block changed ? I am using calico so need to check for the changes.

@Snehil03 - we just moved to use the exec probe to replace the httpGet. Here is how it looks for one of our containers:

    livenessProbe:
      exec:
        command:
        - sh
        - -c
        - curl --fail http://localhost:8000/api/system/ping
      failureThreshold: 4
      initialDelaySeconds: 60
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 1

It eliminated the error for us.

Also I suggest to rename the topic of this issue from “Sometimes” to “VERY frequently” which a) reflects reality better and b) might move this issue up the priority list because I find it kind of annoying that Kubernetes probes have been so flaky and unreliable for such a long time now without anyone making any real progress here.

@SergeyKanzhelev @dchen1107 @derekwaynecarr @mrunalp

It seems really problematic if users are abandoning liveness probes because they are not reliable. We should not ignore this.

Do we need to find a way to give the probers more CPU scheduling time? Higher realtime prio? Maybe a distinct process from kubelet?

We recently upgraded multiple k8s clusters to v1.20 and are now having this issue occur at a much higher rate than previously (on v1.18).

image

^You can see exactly when we started the upgrades this week. Those are k8s events that match context deadline exceeded.

Does anyone know if this issue affects v1.19+ ?

Yes, facing the same problem with a 1.20.2 cluster.

/triage accepted

Are you sure the application don’t hit the resources limits?

In my case, the application starting fine, then the container start using more resources until he hit the limit. After that the readiness probes fail

We are experiencing the same issue on v1.21.12. If this exec workaround mentioned above works, it comes with the downside that not all application containers would have curl installed. Some of them might be distroless. And some might be upstream containers that we don’t own and we can’t easily install another library like curl. So it would ultimately still be nice to find out why the kubelet runs into these issue.

What happened: After updating EKS from v1.14.9-eks-f459c0 to v1.15.11-eks-065dce some of our ReplicaSets and DaemonSets Liveness/Readiness probes started failing with Liveness probe failed: Get http://xxx.xxx.xxx.xxx:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers

It seems that random pods are affected. I can successfully curl to them from other pods.

All nodes and pods are fine, not CPU/Memory/Storage issues.

Two nodes are running 1.15 and the rest 1.14. There doesn’t seem to be a correlation between node version and the issue.

Nothing obvious in kubectl cluster-info dump or journalctl -r or kube-proxy.log

Probes are mostly configured like this (truncated):

    livenessProbe:
      failureThreshold: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1

    readinessProbe:
      failureThreshold: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1

I will increase the timeouts but it would be good to solve this. What changed in 1.15?

What you expected to happen: Liveness/Readiness probes not to fail.

How to reproduce it (as minimally and precisely as possible): Upgrade EKS 1.14 to 1.15

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-16T00:04:31Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-065dce", GitCommit:"065dcecfcd2a91bd68a17ee0b5e895088430bd05", GitTreeState:"clean", BuildDate:"2020-07-16T01:44:47Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: EKS
  • OS (e.g: cat /etc/os-release): 1.14:
ami-048d37e92ce89022e amazon-eks-node-1.14-v20200507
$ uname -a
Linux ip-xxxxxxx.xxxxxx.compute.internal 4.14.177-139.253.amzn2.x86_64 #1 SMP Wed Apr 29 09:56:20 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

1.15

$ ami-0c42d7bd0e31ee2fe amazon-eks-node-1.15-v20200814
$ uname -a
Linux ip-xxxxxx.xxxxxx.compute.internal 4.14.186-146.268.amzn2.x86_64 #1 SMP Tue Jul 14 18:16:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
  • Install tools: AWS Console
  • Network plugin and version (if this is a network-related bug): amazon-k8s-cni:v1.6.3-eksbuild.1

I did 1 and 2 XD #115143

for me it worked after increasing the livenessprobe.timeoutSeconds to 15 seconds and the liveness probe errors stopped. seems like my app is taking a lot oftime to respond with https and slow network and low CPU resources. Thanks for the suggestion.

On v1.21.2 AKS - same issue. Resolved using @mayanksinghse’s fix increasing timeoutSeconds to 10. Default-backend has timeoutSeconds set to 5 by default. Will attempt to decrease from 10 and look for anything useful.

Odd issue since it’s only impacting single cluster – others with same config intermittently show the same error but do not go into CrashLoop.

  • k8s-1.14.8 (Managed by Aliyun cloud provider)
  • CentOS 7.7.1908
  • linux kernel 3.10.0-1062.18.1.el7.x86_64
  • coredns 1.3.1

I guess the physical machine where the k8s node is on has traffic congestion sometimes, or linux kernel has some bug that drops incoming udp packet.

Somebody pointed me to this link: https://cloud.tencent.com/developer/article/1583706 (It’s in Chinese, but most important information is in English)

It’s recommended to insert this podConfig to your pod template:

dnsConfig:
  options:
  - name: timeout
     value: "2"   # glibc default 5s
  - name: ndots
     value: "2"    # k8s default 5, glibc default 1
  - name: attempts
     value: "3"    # glibc default 2
  - name: single-request-reopen

The most important option is single-request-reopen.

  • k8s-1.14.8 (Managed by Aliyun cloud provider)
  • CentOS 7.7.1908
  • linux kernel 3.10.0-1062.18.1.el7.x86_64
  • coredns 1.3.1

kubectl get event --all-namespaces shows some pods fail to response readinessProbe time to time:

xxxx        24m         Warning   Unhealthy                 pod/xxxx-7685fc948-qpsrb          Readiness probe failed: Get http://172.26.25.5:8080/xxxx/check: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

I identified the cause was the pod spent 5s to do dns lookup, further investigation showed the node where the failing pod ran on lost incoming udp packet time to time, so the pod resent dns query 5s later but the readinessProbe timeout is exactly 5s.

After reboot the k8s node where the failing pod ran on, and make sure coredns pods are not scheduled to these suspicious nodes, the issue disappeared.

I guess the physical machine where the k8s node is on has traffic congestion sometimes, or linux kernel has some bug that drops incoming udp packet.

@thockin Conntrack shows hardly 2 or 3 errors. Memory consumption is 60-65% per node.

Just found that the timeouts were for almost all request and not just probe. We added istio lately to check connection stats and understand if the behaviour was due to application. But the findings were weird. Istio itself is now failing readiness probe quite frequently

image

157 failures in ~3 hours

Would like to add that kubernetes is running in our Datacenter. And since iptables version is 1.4.21, --random-fully is not being implemented. But since all machines have same configuration, we ruled out this possibility

deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: server-deployment
  labels:
    app: MyApp
spec:
  replicas: 100
  selector:
    matchLabels:
      app: MyApp
  template:
    metadata:
      labels:
        app: MyApp
    spec:
      containers:
      - name: agnhost
        image: k8s.gcr.io/e2e-test-images/agnhost:2.40
        args:
          - netexec
          - --http-port=80
          - --udp-port=80
        ports:
        - containerPort: 80
        livenessProbe:
          periodSeconds: 1
          httpGet:
            port: 80
        readinessProbe:
          periodSeconds: 1
          httpGet:
            port: 80
        startupProbe:
          periodSeconds: 1
          httpGet:
            port: 80

pprof shows a considerable amount of time dialing/http probing

image

I have to compare against exec probes too

Checking my machine default values:

Resources:

cat /proc/sys/net/ipv4/ip_local_port_range
32768   60999 

net.netfilter.nf_conntrack_max = 262144

Timers:

net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 3600
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300

3 probes per container every second create a 3 TCP connections per second (it has to wait 60 second on TIME-WAIT state holding the socket) that consume an ephemeral port (we have 28231 per tuple) and a conntrack entry. Once the TIME-WAIT timer expires it frees the conntrack entry and the socket.

In the example above, we create a connections at a rate of 3 per pod with 100 pods = 300 connections per second, that will be freed after 60 seconds.

So, we’ll have a constant number of 59x300 = 17700 sockets in TIME-WAIT and conntrack entries just for having the probes


conntrack -L | grep TIME_WA | grep  src=10.244.0.1  | wc
conntrack v1.4.6 (conntrack-tools): 23939 flow entries have been shown.
  23842  357630 3649867

that doesn’t seem to be worrisome if we have 262144 conntrack entries, but it leaves only 28231 - 17700 ephemeral ports for things like DNS and any other hostNetwork stuff

I’ve seen this behaviour with on prem clusters on 1.19 and now with 1.20 on large bare metal nodes. In our findings so far it seems if the pod fs is slow the http probes will fail at intervals. The more pvc’s attached the the pod the more likely something like time df -h will show a value over a few secs. If there is no pvc’s attached we dont see any http probe failures.

Now facing this mainly for the tiller pod (helm), all other pods are generally working fine. I did shutdown all the EKS worker nodes and then recreated, that seems to solve the issue. Looks like a kubernetes 1.16 issue to me.

experiencing this issue in multiple GKE clusters. just started about 5 days ago. right now i’m having to drain the bad node and remove it from the pool. running 1.16.9-gke.6

@thockin
returns-console-558995dd78-b6bjm 0/1 Running 0 23h

For example, if you see the above get pods, readiness probe of one pod is failing almost every day. A restart would fix this. However, we’re unable to find the root cause. But I don’t see any abnormal numbers on CPU or memory or thread count. Doing a describe pod would give me that readiness has failed. How do we debug in such scenario?

Also, this is happening for this particular deployment only. The other pods are running fine.

“returns-console-558995dd78-pbjf8 1/1 Running 0 23h returns-console-558995dd78-plqdr 1/1 Running 0 23h returns-console-558995dd78-tlb5k 1/1 Running 0 23h returns-console-558995dd78-tr2kd 1/1 Running 0 23h returns-console-558995dd78-vfj2n 1/1 Running 0 23h”

Will you revisit HTTP reuse? On Wed, Jan 18, 2023 at 5:15 AM Antonio Ojea @.> wrote: I have to leave 1 out, the results are very confusing and not deterministic. The way http reuses connections c differs between http1 and http2, and golang stdlib behave also different for those protocols. I think that option 2 for probes is the best option, less disruptive and the results I obtained locally are very promising. — Reply to this email directly, view it on GitHub <#89898 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVCAZSVXLTBQ7WFKUPTWS7UIHANCNFSM4MCMLXCA . You are receiving this because you were mentioned.Message ID: @.>

That’s a can of worms, http1 Vs http2 thing is complex , let’s this soak and revisit one we are able to get some data about the impact of this change

let me see how far I go with 1

Experiencing this problem in production level application. I am running a Node.js application on Kubernetes and liveness probe keeps killing the application time to time, usually on the high load. There was no problem with httpGet or 2 timeoutSeconds in probes without high load.

Initial settings

  • liveness probe, readiness probe both using Node.js app’s /health endpoint which just returns 200 ok.
  • httpGet with timeoutSeconds: 2

What I’ve done

  • check keep-alive settings
  • increase probe timeout from 2 -> 10
  • change probe method: httpGet to exec: command: curl ... thing
  • double the replicas(10 -> 20)

And it is still happening. Now I’m about to remove liveness probe after finding out this comment. I guess readiness probe is enough for Node.js app.

Has anyone ever run into issues where the probes fail occasionally with EOF (what does an EOF imply, no response?)? I have been able to confirm that the service is responsive and the /health endpoint works but would fail from time to time with a connection refused.

27m         Warning   Unhealthy           pod/blah   Readiness probe failed: Get "http://an_ip_address:8080/health": EOF
42m         Warning   Unhealthy           pod/blah    Liveness probe failed: Get "http://an_ip_address:8080/health": EOF

Thanks

Thank you so much for help! My problem did solved with simple exec curl, fantastic!! Thanks for all and especial for @alex-kovshovik @pskrzyns

Environment: Amazon EKS, K8s version 1.19, AMI release version 1.19.6-20210628, AMI type AL2_x86_64, instance type c5a.2xlarge

We’re having the same issue since Monday this week, tried everything, but one interesting observation is: the frequency of the occurrence is a lot higher when the nodes have significant memory usage… This could be our app leaking memory, but in that case I would expect K8s to let the app use up to 99% of memory limit and then “Evict” the pod and start a new one. This would’ve been a totally fine behavior, but instead we’re getting:

Liveness probe failed: Get “private-url”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Our nodes all have 16 Gb of RAM, but Datadog monitoring shows that the memory usage exceeds the total available memory somehow:

memory

If the host OS is swapping memory - this could explain why the application processes the requests extremely slowly, slower than liveness probe timeout value (I increased it from 1 to 3 seconds to compensate).

But why would the nodes be swapping memory? I thought K8s does not allocate pods if there isn’t enough physical memory available?

Seeing a similar behavior in an AKS cluster (Azure). Kubernetes v1.16.13 OS and Kernel Ubuntu 16.04.7 LTS 4.15.0-1103-azure

Since this is a managed Kubernetes, we don’t have much visibility into the managed part, but I can say we are trying to utilise the nodes well, so we pack them up with a lot of pods (up to about 50), that have many containers, so I might have ~200 containers running in a node.

Replacing httpGet with a simple exec running curl resolved it completely!

httpGet

Hello @eldada , How did you managed to get these issue fixed ? Could it be possible to paste the entire block changed ? I am using calico so need to check for the changes.

Same here, I’ve noticed the same issue in my GKE cluster. Don’t know when it started, but it happens for many containers which I haven’t updated in more than a year and now they have a lot of probes failing whereas before they never did, so it’s certainly not the container.

I’m getting same issue. Couple Nginx+PHP Pods running on huge instances in parallel with couple other small applications. These are staging apps+nodes without constant traffic. I constantly receive notification that these Nginx+PHP app has restarted… specifically Nginx container of these Pods. At the same time other apps running in the same namespace, nodes never restart.

k -n staging get events
...
22m         Warning   Unhealthy                  pod/myapp-staging3-59f75b5d49-5tscw    Liveness probe failed: Get http://100.100.161.197:80/list/en/health/ping: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
23m         Warning   Unhealthy                  pod/myapp-staging4-7589945cc6-2nf4s    Liveness probe failed: Get http://100.100.161.198:80/list/en/health/ping: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
11m         Warning   Unhealthy                  pod/myapp-staging5-84fb4494db-5dvph    Liveness probe failed: Get http://100.104.124.220:80/list/en/health/ping: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
...

Liveness on an Nginx container looks like this:

  liveness:
    initialDelaySeconds: 10
    periodSeconds: 10
    failureThreshold: 3
    httpGet:
      path: /list/en/health/ping
      port: 80

UPD: Strange thing is that completely distinct deployments you can see above staging4 staging5 stagingN - above 10 deployments fail at once.

My possible problem might be in missing timeoutSeconds which is default 1s

This is the first such report I have seen. There’s nothing obvious about why this would happen.

It’s possible kubelet is too busy and starved for CPU and the probe happened to be thing that got thrashed. How many pods are on this machine? How busy is it?

It’s possible the node itself is thrashing or something and OOM behavior is weird. Does dmesg show any OOMs?

It’s possible some other failure down deep in kubelet is being translated into this? You could try running kubelet at a higher log level to get more details on what is happening.

A lot of bugs have been fixed since 1.12, so we’d need to try to reproduce this and then try again in a more recent version. Is there any way you can help boil down a simpler reproduction?