istio: Mysterious 503's on long running requests

Bug description We are seeing intermittent 503’s on long running requests. We have an endpoint that takes around 4 minutes to respond, and it can fail any time within that. We’ve seen as low as 6 s. Sometimes it completes without issue.

The flow looks like this:

[ingress-nginx-internal] -> [search-one]

It always fails in the same way:

  • DC flag from the destination search-one envoy
  • UC flag from the source ingress-nginx-internal envoy
sort_desc(sum(increase(istio_requests_total{destination_app="search-one", response_code!~"2.*"}[1h])) by (reporter, response_flags, source_app, destination_app) > 0)

{destination_app="search-one",reporter="destination",response_flags="DC",source_app="ingress-nginx-internal"} | 3.0252100840336134
{destination_app="search-one",reporter="source",response_flags="UC",source_app="ingress-nginx-internal"} | 3.0252100840336134

We’re at a bit of a loss at the moment, and wondering if you’ve seen anything like this before.

If we curl the application locally on the pod (bypassing envoy), we do not see a failure.

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [x] Networking [ ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Expected behavior Not to see 503’s on these long running requests.

Steps to reproduce the bug Very difficult to give you an exact reproducible state as it’s inconsistent. More than happy to screen share and live debug.

Version (include the output of istioctl version --remote and kubectl version) 1.3.3

How was Istio installed? Helm

Environment where bug was observed (cloud vendor, OS, etc) GKE 1.14

Additionally, please consider attaching a cluster state archive by attaching the dump file to this issue.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 3
  • Comments: 24 (19 by maintainers)

Most upvoted comments

There was no need re-opening an issue… People from Istio are looking here. You’re starting again from 0 on the new issue…

There was a previous issue with keepaliveMaxServerConnectionAge that was triggering a Listener re-creation in 1.3.1 and 1.3.2, which was supposely resolved in 1.3.3.

This is almost the same issue, but it’s only the virtualInbound listener that seems to change and trigger a reload. I’m not able to test right now, but to ensure this we need to get the hash of the virtualInbound listener send by Pilot. The change is not done every keepaliveMaxServerConnectionAge, but sometimes after 2 or 3. What’s changing in Pilot every once and a while ?

Hmmm, we don’t have keepaliveMaxServerConnectionAge set, I set it explicitly to 24h and still see this so I think we’re seeing the same symptom but different root causes.