istio: Service stops receive readiness prob requests from Kubernetes
Bug description
I have about 20 pods running on Kubernetes cluster. After about 3 days, they begin to refuse Readiness prob HTTP requests.
This is an example of “kubectl describe pod” output:
Warning Unhealthy 4m41s (x104971 over 4d21h) kubelet, digi (combined from similar events): Readiness probe failed: Get http://10.1.1.139:8080/v1/readiness: read tcp 10.1.1.1:35178->10.1.1.139:8080: read: connection reset by peer
Output of same pod after few minutes:
Warning Unhealthy 2m32s (x105570 over 4d21h) kubelet, digi (combined from similar events): Readiness probe failed: Get http://10.1.1.139:8080/v1/readiness: read tcp 10.1.1.1:35028->10.1.1.139:8080: read: connection reset by peer
Seems like process itself inside container is not aware about any issues. Logs look just normally. If I opening ssh into container (e.g. “kubectl exec -it”) - I do able to perform http call to the readiness URL (curl on localhost) - so the issue is only when making requests from the outside, which makes me thinks it somehow related to networking --> Istio.
Trying to make readiness request from another pod fails as well with the following output:
curl 10.1.1.139:8080/v1/readiness -v Trying 10.1.1.139... TCP_NODELAY set Connected to 10.1.1.139 (10.1.1.139) port 8080 (#0) GET /v1/readiness HTTP/1.1 Host: 10.1.1.139:8080 User-Agent: curl/7.58.0 Accept: */* HTTP/1.1 503 Service Unavailable content-length: 95 content-type: text/plain date: Sun, 28 Jul 2019 10:48:37 GMT server: envoy Connection #0 to host 10.1.1.139 left intact upstream connect error or disconnect/reset before headers. reset reason: connection termination
There is no restarts nor on my pods nor on Istio pods.
Deleting pod (which causing Kubernetes to recreate it) will solve the issue, for about next 3 days. After that, the issue starts over again.
Mutual TLS is not enabled.
Affected product area (please put an X in all that apply)
[ ] Configuration Infrastructure [ ] Docs [ ] Installation [X] Networking [ ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure
Expected behavior Readiness probs will work as intended
Steps to reproduce the bug I not doing anything special… Running pod on Kubernetes with http readiness prob enabled.
Version (include the output of istioctl version --remote
and kubectl version
)
I was able to reproduce this issue in the following platforms / versions:
Istio: 1.1.1 / 1.1.8/ 1.2.2
Kubernetes: 1.12 / 1.13 / 1.15
Platform: AWS EKS / local Microk8s
How was Istio installed? Using “Helm template” option
Environment where bug was observed (cloud vendor, OS, etc) on cloud: AWS EKS \ on local: Ubuntu 18.04.02
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 20 (5 by maintainers)
The TLDR is that treating HTTP as TCP does not work when the connection is short-lived, and created frequently.
I successfully reproduced that indeed with http service port treated as tcp, the pod will end-up with many sockets in
TIME_WAIT
state.I then did some networking basics study. If we have many short-lived sockets, then the other side has to wait for a while and during that the socket is in TIME_WAIT state.
The http prober requests from the kubelet matches the criteria. And eventually we accumulated enough
TIME_WAIT
socket, exceeds the system limit[], and thus unable to create the socket for a while.in the node network
Without taking a look of the Envoy configuration dump and the Pod/Deployment YAML, I have a very random guess.
From a past incident, a customer reports many
CLOSE_WAIT
socket is hanging on the app pod. I think if we ensure the inbound cluster is configured ashttp_connection_manager
, that might solve the issue.Can you check your corresponding service port is named as
http.*
per istio port name convention? And then the inbound serving listeners in the end is using http connection manager rather than tcp proxy filter?