istio: upstream connect error or disconnect/reset before headers. reset reason: connection termination

Bug description

When I deploy a service after some point I am getting

upstream connect error or disconnect/reset before headers. reset reason: connection termination error. It is random and i have found out that it only happens (not all the time) but if i got error it is definitely after this line connection is no more.

kubectl logs -n namespace $POD -c istio-proxy

2020-09-24T17:12:35.507004Z     warning envoy config    [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:92] StreamAggregatedResources gRPC config stream closed: 13, 

Also after getting this error and can’t connect to my Istio endpoint/service, 30 minutes later (roughly) i get this line

2020-09-24T17:43:49.473767Z     warning envoy config    [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:92] StreamAggregatedResources gRPC config stream closed: 13,

connection is reinstated and i can connect the service again. It does not happen all the time but when it happens it lasts 30 min exact time to be connection reinstated.

I made a curl test every 30 seconds and it starts at 2020-09-24-17-12 and finishes at 2020-09-24-17-43

tracing-rest-checker-pg8hp tracing-rest-checker 2020-09-24-17-11
tracing-rest-checker-pg8hp tracing-rest-checker {"data":{"names":["proba"],"ndarray":[[0.39731466202150834]]},"meta":{}}
tracing-rest-checker-pg8hp tracing-rest-checker
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker 2020-09-24-17-12
tracing-rest-checker-pg8hp tracing-rest-checker {"data":{"names":["proba"],"ndarray":[[0.39731466202150834]]},"meta":{}}
tracing-rest-checker-pg8hp tracing-rest-checker
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker 2020-09-24-17-12
tracing-rest-checker-pg8hp tracing-rest-checker upstream connect error or disconnect/reset before headers. reset reason: connection termination
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker 2020-09-24-17-13
tracing-rest-checker-pg8hp tracing-rest-checker upstream connect error or disconnect/reset before headers. reset reason: connection termination
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
...
...
...
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker 2020-09-24-17-42
tracing-rest-checker-pg8hp tracing-rest-checker upstream connect error or disconnect/reset before headers. reset reason: connection termination
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker 2020-09-24-17-43
tracing-rest-checker-pg8hp tracing-rest-checker upstream connect error or disconnect/reset before headers. reset reason: connection termination
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker 2020-09-24-17-43
tracing-rest-checker-pg8hp tracing-rest-checker upstream connect error or disconnect/reset before headers. reset reason: connection termination
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker 2020-09-24-17-44
tracing-rest-checker-pg8hp tracing-rest-checker {"data":{"names":["proba"],"ndarray":[[0.39731466202150834]]},"meta":{}}
tracing-rest-checker-pg8hp tracing-rest-checker
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################
tracing-rest-checker-pg8hp tracing-rest-checker 2020-09-24-17-44
tracing-rest-checker-pg8hp tracing-rest-checker {"data":{"names":["proba"],"ndarray":[[0.39731466202150834]]},"meta":{}}
tracing-rest-checker-pg8hp tracing-rest-checker
tracing-rest-checker-pg8hp tracing-rest-checker ################################################################################

Based on the suggestion I also got the before and after proxy config dump with

 while true ; do  kubectl exec -n istio-system  $(kubectl get pods -n istio-system -l app=istiod -o name) -- curl 'localhost:8080/debug/config_dump?proxyID=tracing-and-logging-rest-tracing-0-model1-865b5596fd-jl2vp.emtech'  > after/after_config_dump.txt-$(date -d '1 hour ago' +%Y-%m-%d-%H-%M); sleep 300;  done

Details logs are attached

istio-bug.zip

[ ] Docs [ ] Installation [ X] Networking [ ] Performance and Scalability [ ] Extensions and Telemetry [ ] Security [ ] Test and Release [ X] User Experience [ ] Developer Infrastructure

Expected behavior To be able to connect endpoint all the time.

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version --short and helm version if you used Helm) $ istioctl version --remote client version: 1.6.8 control plane version: 1.6.8 data plane version: 1.6.8 (7 proxies) $ kubectl version --short Client Version: v1.18.6 Server Version: v1.16.13

How was Istio installed? istioctl install

Environment where bug was observed (cloud vendor, OS, etc) I have seen this bug on

AWS EKS 1.17 Kind 1.17 Cluster (onprem) Azure AKS 1.16

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 17
  • Comments: 39 (5 by maintainers)

Most upvoted comments

I just restarted my computer and it fixed it

We have faced this issue too. I’ve found the root cause in our case. Hope this will helps you guys. TL;DR The root cause in our case is time of request exceeds the socket timeout of the HTTP server. Specifically here is Nodejs HTTP server. Increasing the socket timeout of Nodejs HTTP server will solve. You could find the similar setting on your HTTP server. Or use event base or something instead of HTTP request.

Context: We have two services service-a and service-b. They are Nestjs projects, running on Node 12. service-a requests HTTP to service-b.

service-a got 503 with error upstream connect error or disconnect/reset before headers. reset reason: connection termination. But service-b stills processed. it happens randomly. After tracing the logs, we saw that the error occurs when the request from service-a to service-b took too long. I tried to config Istio timeout but it didn’t work.

I created 2 new sample services in local without Istio. Set service-b to response slowly. And I got Error: socket hang up code ECONNRESET after 2 minutes. That makes I think the root cause is the HTTP server, not Istio. Focus on the error from Nodejs, I found out that Node HTTP server will disconnects the socket connection after amount of time. And that amount of time can be set via the option timeout I increase the timeout value of service-b and it works. You could find the similar setting on your HTTP server.

async function bootstrap() {
  const app = await NestFactory.create(AppModule);
  await app.listen(3000);
  const httpServer: Server = app.getHttpServer();
  httpServer.setTimeout(180000); // 3 minutes
}

REMEMBER to increase the socket timeout of service-a if the flow is like the image below, because the HTTP request to service-a can be terminated too.

Screenshot from 2022-07-17 12-25-48

Conclusion:

  1. Increasing soket timeout is temporary solution. You should think about event base, or save the request state and use it for the retry request.
  2. Istio timeout actually works. It effects when Istio’s timeout is smaller than HTTP server socket timeout.

@howardjohn @ramaraochavali Can you please re-open this issue?

It is now 2 years and many people face this issue in production. Can you please prioritize it, as this has bad implications on connections.

There were many reports and if further info needed, I am sure we all can provide infos.

Thanks!

Hey folks - understood the concerns when seeing errors like this. However, this issue is not going to be a helpful path forward. This is a very generic error message that basically means “the request did not work”. Without specific context (in the form of istioctl bug-report, logs, configs, etc), its impossible to make forward progress.

So I am not trying to dismiss this entirely - but please open new issues with details about your case so we can help.

Also see https://istio.io/latest/blog/2021/upcoming-networking-changes/ since that is a fairly basic one that could trigger this.

Thanks!

@puneetloya @a8j8i8t8 this shall be the right issue, as I saw you observed this too