envoy: Debugging 503s
Title: Debugging 503s
Description: We have a service which is returning 503s for like 1% of the total traffic. This target service has a replication of 100 and the calling service has a replication of 50. We have updated the circuit_breaking configuration as below
circuit_breakers:
thresholds:
-
priority: DEFAULT
max_connections: 10000
max_pending_requests: 10000
max_requests: 10000
max_retries: 3
The traffic would be around 10,000 requests/second. And the latency is around 200ms. So i guess this configuration is sufficient enough to handle this traffic.
From the stats, there is no upstream_rq_pending_overflow. We have upstream_rq_pending_failure_eject and upstream_cx_connect_timeout. I can understand upstream_cx_connect_timeout and we have a connection timeout of 0.25s. But what could be the other reasons for upstream_rq_pending_failure_eject? Also any suggestions to debug this 503 issues would be really helpful.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 17 (5 by maintainers)
For others that may be facing similar symptoms (low, but constant rate of HTTP 503’s), be sure to check out the
idle_timeout_ms
setting introduced in Ambassador 0.60+. https://www.getambassador.io/reference/timeouts/#idle-timeout-idle_timeout_msFor example, we had a node.js application that appeared to have an idle timeout set to 5 seconds, but ambassador’s default idle timeout is set to 5 minutes. After setting this to 4 seconds
idle_timeout_ms: 4000
for this service, our HTTP 503’s went away. Ensure that you idle timeout on all proxies “in front of” your services have shorter idle timeouts than the ones behind them.@dnivra26 @Bplotka You can enable HTTP access logging to get the
response_flags
which typically lists the reason for the error code. https://www.envoyproxy.io/docs/envoy/latest/configuration/access_log#format-rulesWe have similar question on our side. We do NOT have any
circuit breakers
set explicitly and we have pretty low traffic, however we are getting 503 occasionally from gRPC reqeust between two envoys.My application see:
rpc error: code = FailedPrecondition desc = transport: received the unexpected content-type \"text/plain\""
I can see on “sidecar” envoy:
Cannot see anything corresponding on target envoy (next hop). Looks like either 503 was returned without any log or sidecar envoy reponsed with 503 immdiately?
Cannot see anything in metrics described here: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/circuit_breaking
…apart from drop in
rate(envoy_cluster_upstream_rq_total[1m])
because of missed 1 request (my app is doing 2 request per every 5s)This is interesting though
'x-envoy-upstream-service-time':'0'
Any clue what am I hitting here?
This might be related…
Envoy does connection pooling to upstream. So
queueing request due to no available connections
only means that there are no established connections to re-use & hence will initiate a new connection to the upstream. This shd not cause any issue. For503s
I did recommend a few settings before.more logs. queueing request due to no available connections from envoy stats active connections did not even cross ~400. and the max connections by default circuit breaking is 1024. what could be the reasons for envoy saying no available connections?
@dnivra26 Might be this.https://github.com/envoyproxy/envoy/issues/2715 try turning on the
keepalives
and check if that helps.