istio: tiny percentage of 503s on a cluster with no change & @low qps

running 100qps through ingress in the perf setup getting a small % of 503s (only on multi hops - maybe a timeout issue?) - it also only happens with >=20 connections or so

35.199.14.248/fortio2/fortio/fetch/echosrv1:8080/echo

Starting at 100 qps with 30 thread(s) [gomax 2] : exactly 3000, 100 calls each (total 3000 + 0)
Ended after 30.071664869s : 3000 calls. qps=99.762
Sleep times : count 2970 avg 0.26524946 +/- 0.01577 min 0.192739205 max 0.292777981 sum 787.790887
Aggregated Function Time : count 3000 avg 0.037497644 +/- 0.01582 min 0.009672528 max 0.10991789 sum 112.492931
# range, mid point, percentile, count
>= 0.00967253 <= 0.01 , 0.00983626 , 0.03, 1
> 0.01 <= 0.012 , 0.011 , 0.17, 4
> 0.012 <= 0.014 , 0.013 , 0.70, 16
> 0.014 <= 0.016 , 0.015 , 2.63, 58
> 0.016 <= 0.018 , 0.017 , 5.33, 81
> 0.018 <= 0.02 , 0.019 , 8.17, 85
> 0.02 <= 0.025 , 0.0225 , 19.70, 346
> 0.025 <= 0.03 , 0.0275 , 33.83, 424
> 0.03 <= 0.035 , 0.0325 , 52.87, 571
> 0.035 <= 0.04 , 0.0375 , 67.87, 450
> 0.04 <= 0.045 , 0.0425 , 76.67, 264
> 0.045 <= 0.05 , 0.0475 , 82.83, 185
> 0.05 <= 0.06 , 0.055 , 91.00, 245
> 0.06 <= 0.07 , 0.065 , 94.97, 119
> 0.07 <= 0.08 , 0.075 , 98.10, 94
> 0.08 <= 0.09 , 0.085 , 98.87, 23
> 0.09 <= 0.1 , 0.095 , 99.60, 22
> 0.1 <= 0.109918 , 0.104959 , 100.00, 12
# target 50% 0.0342469
# target 75% 0.044053
# target 99% 0.0918182
# target 99.9% 0.107438
Code 200 : 2995 (99.8 %)
Code 503 : 5 (0.2 %)
Response Header Sizes : count 3000 avg 165.72233 +/- 6.772 min 0 max 167 sum 497167
Response Body/Total Sizes : count 3000 avg 166.09833 +/- 2.432 min 165 max 226 sum 498295
Saved result to data/2018-02-08-055310_Fortio.json
All done 3000 calls 37.498 ms avg, 99.8 qps

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 87 (76 by maintainers)

Most upvoted comments

I have run into this issue as well.

[xxx] [2018-10-02T16:12:44.409Z] "GET /glb HTTP/1.1" 503 UC 0 57 1 - "-" "GoogleHC/1.0" "ecb1c797-a58f-9c53-be4f-857bc2884456" "10.100.0.100" "127.0.0.1:3000"

It appears to affect apps running Node/express but not Play. curl -v indicates that The Node app responds with Connection: keep-alive, but the Play app does not send that response header. The issue seems to go away by setting Connection: close in the response.

  • Is this the recommended approach?
  • Can envoy not retry these cases?
  • What is the performance implication of not using keep-alive connections?
  • Should this be in the Istio and/or Envoy docs?

Hey,

i just ran into what seamed to be the same issue. our system is running ambassador with istio 1.0.5 (mtls enabled) and a node/express server, serving some static files. usually that service gets hit with a few requests in a short burst, and sometimes one of those requests failed with a 503 UC. i’ve added the connection close to the response header and it seams to fix the 503 responses, though i have the same question as @jsw regarding the recommended approach and so forth.

i’d highly appreciate it if someone would take the time to clarify this and maybe it should be documented somewhere?

A trace log for a 503 upstream connection:

[2018-03-03 00:11:36.931][20][debug][client] external/envoy/source/common/http/codec_client.cc:23] [C581] connecting
[2018-03-03 00:11:36.931][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:559] [C581] connecting to 127.0.0.1:8080
[2018-03-03 00:11:36.932][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:568] [C581] connection in progress
[2018-03-03 00:11:37.178][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:377] [C581] socket event: 2
[2018-03-03 00:11:37.178][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:445] [C581] write ready
[2018-03-03 00:11:37.180][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:455] [C581] connected
[2018-03-03 00:11:37.180][20][debug][client] external/envoy/source/common/http/codec_client.cc:52] [C581] connected
[2018-03-03 00:11:37.181][20][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:225] [C581] attaching to next request
[2018-03-03 00:11:37.184][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:314] [C581] writing 506 bytes, end_stream false
[2018-03-03 00:11:37.184][20][trace][connection] external/envoy/source/common/network/raw_buffer_socket.cc:67] [C581] write returns: 506
[2018-03-03 00:11:37.459][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:377] [C581] socket event: 2
[2018-03-03 00:11:37.459][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:445] [C581] write ready
[2018-03-03 00:11:37.487][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:377] [C581] socket event: 3
[2018-03-03 00:11:37.487][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:445] [C581] write ready
[2018-03-03 00:11:37.487][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:415] [C581] read ready
[2018-03-03 00:11:37.488][20][trace][connection] external/envoy/source/common/network/raw_buffer_socket.cc:25] [C581] read returns: 166
[2018-03-03 00:11:37.488][20][trace][connection] external/envoy/source/common/network/raw_buffer_socket.cc:25] [C581] read returns: 0
[2018-03-03 00:11:37.488][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:298] [C581] parsing 166 bytes
[2018-03-03 00:11:37.489][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=content-length value=0
[2018-03-03 00:11:37.489][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=date value=Sat, 03 Mar 2018 00:11:36 GMT
[2018-03-03 00:11:37.489][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=content-type value=text/plain; charset=utf-8
[2018-03-03 00:11:37.490][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=x-envoy-upstream-service-time value=49
[2018-03-03 00:11:37.490][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:354] [C581] headers complete
[2018-03-03 00:11:37.491][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:285] [C581] completed header: key=server value=envoy
[2018-03-03 00:11:37.497][20][debug][client] external/envoy/source/common/http/codec_client.cc:81] [C581] response complete
[2018-03-03 00:11:37.501][20][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:200] [C581] response complete
[2018-03-03 00:11:37.502][20][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:225] [C581] attaching to next request
[2018-03-03 00:11:37.503][20][trace][connection] external/envoy/source/common/network/connection_impl.cc:314] [C581] writing 509 bytes, end_stream false
[2018-03-03 00:11:37.504][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:315] [C581] parsed 166 bytes
[2018-03-03 00:11:37.504][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:439] [C581] remote close
[2018-03-03 00:11:37.505][20][debug][connection] external/envoy/source/common/network/connection_impl.cc:134] [C581] closing socket: 0
[2018-03-03 00:11:37.505][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:298] [C581] parsing 0 bytes
[2018-03-03 00:11:37.505][20][trace][http] external/envoy/source/common/http/http1/codec_impl.cc:315] [C581] parsed 0 bytes
[2018-03-03 00:11:37.506][20][debug][client] external/envoy/source/common/http/codec_client.cc:70] [C581] disconnect. resetting 1 pending requests
[2018-03-03 00:11:37.506][20][debug][client] external/envoy/source/common/http/codec_client.cc:91] [C581] request reset
[2018-03-03 00:11:37.520][20][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:115] [C581] client disconnected

This looks weird because envoy read 166 bytes and parsed it. But ended with resetting 1 pending requests. Will follow the source around codec_impl later.

I reproduced this in my env with 0.6.0. I don’t think it related to connection timeout, as in fortio result, the max response was around 140ms, which is way lower than 1s.

In ingress access log: (see here for log format)

[2018-03-02T01:56:49.871Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 - 0 57 42 38 "10.4.0.1" "istio/fortio-0.7.3-pre" "e539e54b-e239-9f44-a3fa-54f6a6d1c83d" "35.197.53.122" "10.4.0.10:8080"

The response flag is - which means the 503 from upstream.

In deployment1 sidecar access log:

[2018-03-02T01:56:49.896Z] "GET /fortio1/fortio/fetch/echosrv2:8080/echo HTTP/1.1" 503 UC 0 57 14 - "10.4.0.1" "istio/fortio-0.7.3-pre" "e539e54b-e239-9f44-a3fa-54f6a6d1c83d" "35.197.53.122" "127.0.0.1:8080"

the response flag is UC, which means Upstream connection termination in addition to 503 response code..

The 503 is coming from the sidecar, and seems upstream of sidecar (echosrv) is terminating connection.

In sidecar envoy stats (I got total 3 of 503s):

cluster.in.8080.upstream_cx_destroy_remote_with_active_rq: 3
cluster.in.8080.upstream_cx_destroy_with_active_rq: 3

this means the echosrv is terminating the connection while request is still active. Any idea why this happen?

@ldemailly in istio 0.8, i met the same error, tiny percent request, the sidecar envoy returns 503, do not use any rule ,log is below: [2018-07-06T06:11:12.983Z] “POST /api/ HTTP/1.1” 503 UC 16687 57 0 - “-” “Mozilla/5.0 (pc-x86_64-linux-gnu) Siege/4.0.4” “41344284-c914-9db1-a30a-983e74455227” “10.20.10.20:8888” “127.0.0.1:8888” i saw the talk here and in envoy reference, It seems that the problem has not been solved, Is that right?

The RemoteClose event right after onData will trigger it.

thanks a lot @ldemailly @lizan this is quite a finding!

Now I believe this is an issue at Envoy, filed https://github.com/envoyproxy/envoy/issues/2715.

@PiotrSikora the hope was that while you have a pilot less setup to investigate mtls perf issue, whether you also see 503s during your testing (which would rule out pilot / bad config) and confirm andra’s determination