envoy: Debugging 503s

Title: Debugging 503s

Description: We have a service which is returning 503s for like 1% of the total traffic. This target service has a replication of 100 and the calling service has a replication of 50. We have updated the circuit_breaking configuration as below

circuit_breakers:
             thresholds:
              -
               priority: DEFAULT
               max_connections: 10000
               max_pending_requests: 10000
               max_requests: 10000
               max_retries: 3

The traffic would be around 10,000 requests/second. And the latency is around 200ms. So i guess this configuration is sufficient enough to handle this traffic.

From the stats, there is no upstream_rq_pending_overflow. We have upstream_rq_pending_failure_eject and upstream_cx_connect_timeout. I can understand upstream_cx_connect_timeout and we have a connection timeout of 0.25s. But what could be the other reasons for upstream_rq_pending_failure_eject? Also any suggestions to debug this 503 issues would be really helpful.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 17 (5 by maintainers)

Most upvoted comments

For others that may be facing similar symptoms (low, but constant rate of HTTP 503’s), be sure to check out the idle_timeout_ms setting introduced in Ambassador 0.60+. https://www.getambassador.io/reference/timeouts/#idle-timeout-idle_timeout_ms

For example, we had a node.js application that appeared to have an idle timeout set to 5 seconds, but ambassador’s default idle timeout is set to 5 minutes. After setting this to 4 seconds idle_timeout_ms: 4000 for this service, our HTTP 503’s went away. Ensure that you idle timeout on all proxies “in front of” your services have shorter idle timeouts than the ones behind them.

@dnivra26 @Bplotka You can enable HTTP access logging to get theresponse_flagswhich typically lists the reason for the error code. https://www.envoyproxy.io/docs/envoy/latest/configuration/access_log#format-rules

We have similar question on our side. We do NOT have any circuit breakers set explicitly and we have pretty low traffic, however we are getting 503 occasionally from gRPC reqeust between two envoys.

My application see: rpc error: code = FailedPrecondition desc = transport: received the unexpected content-type \"text/plain\""

I can see on “sidecar” envoy:

[2018-07-03 15:26:43.991][14][debug][router] source/common/router/router.cc:298] [C151754][S2751502683535461939]   'x-request-id':'ab61454a-cdfd-4b4a-a351-c5f9182484f0'
[2018-07-03 15:26:43.991][14][debug][pool] source/common/http/http2/conn_pool.cc:97] [C151755] creating stream
[2018-07-03 15:26:43.991][14][debug][router] source/common/router/router.cc:966] [C151754][S2751502683535461939] pool ready
[2018-07-03 15:26:43.991][14][debug][http] source/common/http/conn_manager_impl.cc:790] [C151754][S2751502683535461939] request end stream
[2018-07-03 15:26:43.992][14][debug][router] source/common/router/router.cc:590] [C151757][S15483029364081032919] upstream headers complete: end_stream=false
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:963] [C151757][S15483029364081032919] encoding headers via codec (end_stream=false):
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   ':status':'503'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'content-length':'19'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'content-type':'text/plain'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'date':'Tue, 03 Jul 2018 15:26:43 GMT'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'server':'envoy'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'x-envoy-upstream-service-time':'0'
[2018-07-03 15:26:43.992][14][debug][client] source/common/http/codec_client.cc:94] [C151755] response complete
[2018-07-03 15:26:43.992][14][debug][pool] source/common/http/http2/conn_pool.cc:189] [C151755] destroying stream: 1 remaining
[2018-07-03 15:26:43.992][14][debug][http2] source/common/http/http2/codec_impl.cc:517] [C151757] stream closed: 0
[2018-07-03 15:26:43.992][14][debug][http2] source/common/http/http2/codec_impl.cc:517] [C151755] stream closed: 0

Cannot see anything corresponding on target envoy (next hop). Looks like either 503 was returned without any log or sidecar envoy reponsed with 503 immdiately?

Cannot see anything in metrics described here: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/circuit_breaking

…apart from drop in rate(envoy_cluster_upstream_rq_total[1m]) because of missed 1 request (my app is doing 2 request per every 5s)

This is interesting though 'x-envoy-upstream-service-time':'0'

Any clue what am I hitting here?

I have seen 503s returned by envoy when upstream silently closes an idle connection but envoy is not aware of this event and tries to send a new request on this connection. Configuring retry_policy at the route level or configuring keepalives for the upstream cluster helped.

This might be related…

Envoy does connection pooling to upstream. So queueing request due to no available connections only means that there are no established connections to re-use & hence will initiate a new connection to the upstream. This shd not cause any issue. For 503s I did recommend a few settings before.

more logs. queueing request due to no available connections from envoy stats active connections did not even cross ~400. and the max connections by default circuit breaking is 1024. what could be the reasons for envoy saying no available connections?

[2018-07-07 23:12:40.396][130][debug][pool] source/common/http/http1/conn_pool.cc:88] [C36833] using existing connection
[2018-07-07 23:12:40.397][130][debug][pool] source/common/http/http1/conn_pool.cc:206] [C36833] response complete
[2018-07-07 23:12:40.397][130][debug][pool] source/common/http/http1/conn_pool.cc:244] [C36833] moving to ready
[2018-07-07 23:12:40.398][132][debug][pool] source/common/http/http1/conn_pool.cc:88] [C35752] using existing connection
[2018-07-07 23:12:40.405][130][debug][pool] source/common/http/http1/conn_pool.cc:79] creating a new connection
[2018-07-07 23:12:40.405][130][debug][pool] source/common/http/http1/conn_pool.cc:105] queueing request due to no available connections
[2018-07-07 23:12:40.406][130][debug][pool] source/common/http/http1/conn_pool.cc:249] [C36950] attaching to next request
[2018-07-07 23:12:40.415][130][debug][pool] source/common/http/http1/conn_pool.cc:206] [C36950] response complete
[2018-07-07 23:12:40.415][130][debug][pool] source/common/http/http1/conn_pool.cc:244] [C36950] moving to ready
[2018-07-07 23:12:40.415][132][debug][pool] source/common/http/http1/conn_pool.cc:206] [C35752] response complete
[2018-07-07 23:12:40.415][132][debug][pool] source/common/http/http1/conn_pool.cc:244] [C35752] moving to ready
[2018-07-07 23:12:40.459][132][debug][pool] source/common/http/http1/conn_pool.cc:88] [C35752] using existing connection

@dnivra26 Might be this.https://github.com/envoyproxy/envoy/issues/2715 try turning on the keepalives and check if that helps.