envoy: Envoy intermittently responds with 503 UC (upstream_reset_before_response_started{connection_termination})

Title: Envoy intermittently responds with 503 UC (upstream_reset_before_response_started{connection_termination})

Description:

What issue is being seen? Describe what should be happening instead of the bug, for example: Envoy should not crash, the expected value isn’t returned, etc.

We have a group of Envoy servers running as an Edge Proxy for a number of backend services. We configure these Envoy servers using in-house built xDS GRPC server. See example configuration dump attached to this issue.

We started noticing intermittent responses with error code 503 and response flag “UC” + upstream_reset_before_response_started{connection_termination} appearing in response code details. It seems that these responses are produced by Envoy and not returned by the backend services. We observed this issue while running Envoy versions v1.16 and v1.17.

We looked at some previously reported similar GitHub issues, such as:

We tried to change some configuration settings, namely:

  • enable TCP Keep-alives
  • adjust Idle Timeouts (for ex. use default values, disable idle timeout)

None of the actions mentioned above had any influence on the behavior of our Envoys’ fleet.

Repro steps: We have no repeatable method for reproducing these errors. They are very sporadic, ie. 1-2 per 100000 requests. They are spread across various clusters. Some of these clusters receive large and constant amount of requests so the possibility of the upstream connections to become idle is very unlikely.

Config: See included in tarball attached to this issue

Logs: This issue occurs in our Production environment where we do not have debug level logging enabled.

Can you please advise:

  • why does the Envoy terminates connections?
  • have this problem been reported by other Envoy’s users?
  • is there a possibility that requests are not distributed evenly across all backend servers which could lead to Envoy resetting upstream connections?
  • what can we do on our side to eliminate this problem?

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 26
  • Comments: 29 (5 by maintainers)

Most upvoted comments

It’s been a while since I’ve last dealt with this, but my case ultimately had to do with mismatched timeout settings, with Envoy having a higher default idle timeout value than one of the backend applications in an Istio mesh. Envoy would thus try and keep the connection open for a longer amount of time than the backend application’s settings allow, and the backend thus severed the connection once the timeout value was reached. The solution was to add DestinationRule resources which overrode the default idle timeout on the Envoy side of things. You could, of course, also adjust the backend application’s settings if it’s practical/appropriate.

In case it might be useful, I figured this out by running tcpdump and noticing that the 503s were returned after exactly n seconds on every instance. From there on, it was just a matter of checking the backend and Envoy’s timeout values, confirming the match and ultimately fixing the issue.

Hi, we are also facing this issue 503 UC during High Load on production, and can reproduce it on our Load Testing env with rate 13 out of 100k requests.

  • Envoy v1.20.7
  • Downstream and Upstream are both spring boot java apps

We are going to implement retry policy as suggested above, but maybe can help with the rootcause or alternative solution.

Logs

no tcpdump gathered so far

[2022-12-02 18:08:52.760][23][debug][http] [source/common/http/conn_manager_impl.cc:274] [C4381] new stream
[2022-12-02 18:08:52.760][23][debug][http] [source/common/http/conn_manager_impl.cc:867] [C4381][S2002065306070302618] request headers complete (end_stream=false):
':authority', 'xxx:8080'
':path', '/api/v1/xxx'
':method', 'POST'
'accept', 'application/json'
'content-type', 'application/json'
'authorization', 'Bearer xxx'
'tenant', '76356188-e2de-4e7f-9628-3b637dab3960'
'x-request-id', '69b7c8f56d0dce335c07c6e6f4b01e12'
'x-b3-traceid', 'a63dd122acf1982e'
'x-b3-spanid', '620c330fe2028275'
'x-b3-parentspanid', '07dec65c20d7d27e'
'x-b3-sampled', '1'
'accept-encoding', 'gzip, deflate'
'content-length', '3301'
'connection', 'Keep-Alive'
'user-agent', 'Apache-HttpClient/4.5.13 (Java/17.0.4.1)'

[2022-12-02 18:08:52.760][23][debug][router] [source/common/router/router.cc:457] [C4381][S2002065306070302618] cluster 'xxx' match for URL '/api/v1/xxx'
[2022-12-02 18:08:52.761][23][debug][router] [source/common/router/router.cc:673] [C4381][S2002065306070302618] router decoding headers:
':authority', 'xxx:8080'
':path', '/api/v1/xxx'
':method', 'POST'
':scheme', 'http'
'accept', 'application/json'
'content-type', 'application/json'
'authorization', 'Bearer xxx'
'tenant', '76356188-e2de-4e7f-9628-3b637dab3960'
'x-request-id', '69b7c8f56d0dce335c07c6e6f4b01e12'
'x-b3-traceid', 'a63dd122acf1982e'
'x-b3-spanid', '620c330fe2028275'
'x-b3-parentspanid', '07dec65c20d7d27e'
'x-b3-sampled', '1'
'accept-encoding', 'gzip, deflate'
'content-length', '3301'
'user-agent', 'Apache-HttpClient/4.5.13 (Java/17.0.4.1)'
'x-forwarded-proto', 'http'
'x-envoy-expected-rq-timeout-ms', '120000'

[2022-12-02 18:08:52.761][23][debug][pool] [source/common/http/conn_pool_base.cc:74] queueing stream due to no available connections
[2022-12-02 18:08:52.761][23][debug][pool] [source/common/conn_pool/conn_pool_base.cc:256] trying to create new connection
[2022-12-02 18:08:52.761][23][debug][pool] [source/common/conn_pool/conn_pool_base.cc:144] creating a new connection
[2022-12-02 18:08:52.761][23][debug][client] [source/common/http/codec_client.cc:60] [C4599] connecting
[2022-12-02 18:08:52.761][23][debug][connection] [source/common/network/connection_impl.cc:890] [C4599] connecting to 10.101.11.122:8080
[2022-12-02 18:08:52.761][23][debug][connection] [source/common/network/connection_impl.cc:909] [C4599] connection in progress
[2022-12-02 18:08:52.761][23][debug][http] [source/common/http/filter_manager.cc:841] [C4381][S2002065306070302618] request end stream
[2022-12-02 18:08:52.761][23][debug][connection] [source/common/network/connection_impl.cc:672] [C4599] connected
[2022-12-02 18:08:52.761][23][debug][client] [source/common/http/codec_client.cc:88] [C4599] connected
[2022-12-02 18:08:52.761][23][debug][pool] [source/common/conn_pool/conn_pool_base.cc:294] [C4599] attaching to next stream
[2022-12-02 18:08:52.761][23][debug][pool] [source/common/conn_pool/conn_pool_base.cc:177] [C4599] creating stream
[2022-12-02 18:08:52.761][23][debug][router] [source/common/router/upstream_request.cc:416] [C4381][S2002065306070302618] pool ready
[2022-12-02 18:08:52.936][25][debug][connection] [source/common/network/connection_impl.cc:640] [C4578] remote close
[2022-12-02 18:08:52.936][25][debug][connection] [source/common/network/connection_impl.cc:249] [C4578] closing socket: 0
[2022-12-02 18:08:52.936][25][debug][client] [source/common/http/codec_client.cc:106] [C4578] disconnect. resetting 1 pending requests
[2022-12-02 18:08:52.936][25][debug][client] [source/common/http/codec_client.cc:142] [C4578] request reset
[2022-12-02 18:08:52.936][25][debug][router] [source/common/router/router.cc:1130] [C4250][S14424083748194906660] upstream reset: reset reason: connection termination, transport failure reason:
[2022-12-02 18:08:52.936][28][debug][connection] [source/common/network/connection_impl.cc:640] [C4587] remote close
[2022-12-02 18:08:52.936][28][debug][connection] [source/common/network/connection_impl.cc:249] [C4587] closing socket: 0
[2022-12-02 18:08:52.936][23][debug][connection] [source/common/network/connection_impl.cc:640] [C4561] remote close
[2022-12-02 18:08:52.936][23][debug][connection] [source/common/network/connection_impl.cc:249] [C4561] closing socket: 0
[2022-12-02 18:08:52.936][28][debug][client] [source/common/http/codec_client.cc:106] [C4587] disconnect. resetting 1 pending requests
[2022-12-02 18:08:52.936][28][debug][client] [source/common/http/codec_client.cc:142] [C4587] request reset
[2022-12-02 18:08:52.936][28][debug][router] [source/common/router/router.cc:1130] [C4166][S8149778245575148947] upstream reset: reset reason: connection termination, transport failure reason:
[2022-12-02 18:08:52.936][25][debug][http] [source/common/http/filter_manager.cc:953] [C4250][S14424083748194906660] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2022-12-02 18:08:52.936][23][debug][client] [source/common/http/codec_client.cc:106] [C4561] disconnect. resetting 1 pending requests
[2022-12-02 18:08:52.936][23][debug][client] [source/common/http/codec_client.cc:142] [C4561] request reset
[2022-12-02 18:08:52.936][23][debug][router] [source/common/router/router.cc:1130] [C4242][S16814319887871412706] upstream reset: reset reason: connection termination, transport failure reason:
[2022-12-02 18:08:52.936][25][debug][http] [source/common/http/conn_manager_impl.cc:1467] [C4250][S14424083748194906660] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '95'
'content-type', 'text/plain'
'date', 'Fri, 02 Dec 2022 18:08:52 GMT'

[2022-12-02 18:08:52.936][23][debug][http] [source/common/http/filter_manager.cc:953] [C4242][S16814319887871412706] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2022-12-02 18:08:52.936][28][debug][http] [source/common/http/filter_manager.cc:953] [C4166][S8149778245575148947] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2022-12-02 18:08:52.936][23][debug][http] [source/common/http/conn_manager_impl.cc:1467] [C4242][S16814319887871412706] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '95'
'content-type', 'text/plain'
'date', 'Fri, 02 Dec 2022 18:08:52 GMT'

Observation

That looks like connection is laying down in the cache and it happens sometimes that it is staying in the cache for 60s untouched and then when used - fails with the error. Question is - where is this 60s timeout is configured? on upstream side? I was not able to find it there yet… image image image

We just followed Envoy documentation on Retry policies:

match:
  prefix: /some/url/prefix
route:
  retry_policy:
    retry_on: reset

You can also specify num_retries which by default is 1 which is how we wanted it hence we didn’t add it explicitly to configuration.

Yeah, so if I had to bet I’d guess your upstream is sending a response without a connection close header, then closing the connection. a TCP dump will tell you if this is indeed what’s happening (if upstream is sending that TCP FIN), and confirm or deny you’re hitting the race I mentioned. Unfortunately there’s not a ton Envoy can do here today - generally we encourage folks to make sure if the upstream is closing HTTP/1.1 connections it use the connection: close header (which will cause Envoy to not reuse the connection, and avoid the race entirely).

On Tue, Feb 9, 2021 at 12:28 PM aqua777 notifications@github.com wrote:

@alyssawilk https://github.com/alyssawilk AFAIK, these requests are using HTTP/1.1 indeed. By upstream disconnect do you mean that in your interpretation the connection is being reset by the upstream side?

What would you recommend to capture package trace? Something like tcpdump or do you have other tools in mind?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/14981#issuecomment-776106438, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELALPNLD44Z47K4BAPV3RLS6FWDVANCNFSM4XJJRKNA .