istio: 503 "upstream connect error or disconnect/reset before headers" in 1.1 with low traffic
Describe the bug Many services in our environment are experiencing a handful (~1% or so) of 503 errors with the response “upstream connect error or disconnect/reset before headers”. I have rolled istio out to two environments and they both experience similar issues.
These environments do not have any Istio gateways – it’s just service to service communication; however, I did replicate the issue using curl through k8s ingress (outside mesh) to a mesh service so I assume it’s isolated to the destination’s sidecar, which returns the 503.
Digging through some older issues, it appears this is related to a keepalive timeout, which does fit my experience. It’s usually my first curl after a long break that fails, if that makes sense with anything. In 1.1, there was a default retry added that I believe was supposed to handle this case but it doesn’t appear to be kicking in for me.
Is there anything that needs to be enabled for the default retry to take effect? Policy checks, for example. I also don’t have Virtual Services or Destination Rules added yet but I was told those are needed for it either. I did do a quick test by adding one and it didn’t help.
Here’s a curl example (through k8s ingress) that I tried this morning after being Idle all night:
~ curl -v https://my-service.stg.example.com/v1/available-inventory/201904091406 https://my-service.stg.example.com/v1/available-inventory/201904091406 5m
* Trying 10.20.23.98...
* TCP_NODELAY set
* Connected to my-service.stg.example.com (10.20.23.98) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
* CAfile: /etc/ssl/cert.pem
CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server accepted to use h2
* Server certificate:
* subject: C=US; ST=District of Columbia; L=Washington; O=MyCompany, LLC; OU=Engineering; CN=*.stg.example.com
* start date: Jan 17 00:00:00 2017 GMT
* expire date: Mar 25 12:00:00 2020 GMT
* subjectAltName: host "my-service.stg.example.com" matched cert's "*.stg.example.com"
* issuer: C=US; O=DigiCert Inc; CN=DigiCert SHA2 Secure Server CA
* SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fc42400d400)
> GET /v1/available-inventory/201904091406 HTTP/2
> Host: my-service.stg.example.com
> User-Agent: curl/7.54.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 503
< server: nginx/1.15.8
< date: Wed, 10 Apr 2019 13:36:44 GMT
< content-type: text/plain
< content-length: 57
< x-envoy-decorator-operation: my-service.default.svc.cluster.local:80/*
< strict-transport-security: max-age=15724800; includeSubDomains
<
* Connection #0 to host my-service.stg.example.com left intact
upstream connect error or disconnect/reset before headers* Found bundle for host my-service.stg.example.com: 0x7fc423c1d2e0 [can multiplex]
* Re-using existing connection! (#0) with host my-service.stg.example.com
* Connected to my-service.stg.example.com (10.20.23.98) port 443 (#0)
* Using Stream ID: 3 (easy handle 0x7fc42400d400)
> GET /v1/available-inventory/201904091406 HTTP/2
> Host: my-service.stg.example.com
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/2 200
< server: nginx/1.15.8
< date: Wed, 10 Apr 2019 13:36:44 GMT
< content-type: application/json; charset=utf-8
< content-length: 18
< x-powered-by: Express
< etag: W/"12-0TGR6gSl8vAg7WE2Lc4UXbiP8BI"
< x-envoy-upstream-service-time: 4
< x-envoy-decorator-operation: my-service.default.svc.cluster.local:80/*
< strict-transport-security: max-age=15724800; includeSubDomains
<
* Connection #0 to host my-service.stg.example.com left intact
{"inventory":null}
Expected behavior 503s with “upstream connect error or disconnect/reset before headers” are retried by default.
Steps to reproduce the bug Not completely sure, but some key information:
- No Virtual Services or Destination Rules
- 560 pods in 200 services
- Node services mostly (which I understand have some keepalive-related defaults in some Istio issues)
Version Istio 1.1.2 K8S 1.11.8
Installation Helm template / kubectl apply; nothing of note in the helm value overrides – just setting a minimum of 2 for istio pods, rewriting http probes, enabling mtls (permissive), and enabling grafana/kiali.
Environment AWS / Kops
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 15
- Comments: 35 (30 by maintainers)
Considering you’re able to consistently repro your 503, I would personally turn on sidecar debug logs on both your source and destination by port-fowrarding 15000 and doing
curl -XPOST http://127.0.0.1:15000/logging?level=debug
, then tailing the logs-c istio-proxy
for both sidecars, and then make your request.The logs are super verbose but should give some insight.
Thanks for the tips. I wouldn’t say I can consistently repro it. It’s common enough to annoy users but just infrequent enough to be difficult to catch. I’ll give it a shot though.
I’m noticing now that no matter how hard I try when the cluster is virtually no traffic at night, it’s very difficult to replicate. While during the day, I could replicate it by hitting a
/ping
endpoint that just responds with a 200 OK.Something I’m just now realizing is that the services that have the problem are all services that can take quite long to respond. I’m wondering if after a request is kicked off that takes 40s to respond (and Istio times out at the 15s default), if there are any scenarios where subsequent requests would hit this issue? It was relatively common during business hours with the
/ping
endpoint, likely while other users were hitting the longer request endpoints. Just to be clear, these long-running requests are not a problem without going through Istio.The 15s timeout on longer requests is a problem of it’s own but that one is easy enough to solve.
Did you add this yourself? For these to work you’ll need to pass the
x-envoy-retry-on
header to your egress requests: https://www.envoyproxy.io/docs/envoy/latest/configuration/http_filters/router_filter#x-envoy-retry-on.In most cases, these 503s seem to happen when an upstream service closes a connection but Envoy still thinks the connection is reusable and puts it back in the coonection pool. Idle timeout was implemented to address this issue: you set it in your cluster definition (it’ll be a
DestinationRule
in case of Istio) to a value lower than what the server typically uses to close connections, this way the connection pool removes it before the upstream server closes it. This of course comes with the overhead of having to reestablish a new connection for future requests if a connection expires after being idle for T seconds, which could be especially noticeable for tls connections; so there’s a trade-off here.To see what’s causing the 503s in your specific case you’ll have to run a tcpdump and look at captured tcp traffic, but based on the symptoms you described (it occurring after being idle for some time) it’s likely it’s a keepalive issue and not a connect failure case.
I think we are experiencing this with istio 1.4.4. If we don’t set
maxRequestsPerConnection
to 1, we are getting 503s all the time. This feature still makes sense for us, so I’d like to keep it on, but with that many errors, our hands are tied.It looks like others are seeing the same thing. Yeah let it run for a few days and see how it goes.
Continuing the remainder of this topic in https://github.com/envoyproxy/envoy/issues/6190 for now then. Thanks for all the help @arianmotamedi – it also looks like 1.1.3 is tagged so we should have a release with the timeout setting pretty soon!