istio: Mysterious 503's on long running requests
Bug description We are seeing intermittent 503’s on long running requests. We have an endpoint that takes around 4 minutes to respond, and it can fail any time within that. We’ve seen as low as 6 s. Sometimes it completes without issue.
The flow looks like this:
[ingress-nginx-internal]
-> [search-one]
It always fails in the same way:
DC
flag from the destinationsearch-one
envoyUC
flag from the sourceingress-nginx-internal
envoy
sort_desc(sum(increase(istio_requests_total{destination_app="search-one", response_code!~"2.*"}[1h])) by (reporter, response_flags, source_app, destination_app) > 0)
{destination_app="search-one",reporter="destination",response_flags="DC",source_app="ingress-nginx-internal"} | 3.0252100840336134
{destination_app="search-one",reporter="source",response_flags="UC",source_app="ingress-nginx-internal"} | 3.0252100840336134
We’re at a bit of a loss at the moment, and wondering if you’ve seen anything like this before.
If we curl the application locally on the pod (bypassing envoy), we do not see a failure.
Affected product area (please put an X in all that apply)
[ ] Configuration Infrastructure [ ] Docs [ ] Installation [x] Networking [ ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure
Expected behavior Not to see 503’s on these long running requests.
Steps to reproduce the bug Very difficult to give you an exact reproducible state as it’s inconsistent. More than happy to screen share and live debug.
Version (include the output of istioctl version --remote
and kubectl version
)
1.3.3
How was Istio installed? Helm
Environment where bug was observed (cloud vendor, OS, etc) GKE 1.14
Additionally, please consider attaching a cluster state archive by attaching the dump file to this issue.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 3
- Comments: 24 (19 by maintainers)
There was no need re-opening an issue… People from Istio are looking here. You’re starting again from 0 on the new issue…
There was a previous issue with
keepaliveMaxServerConnectionAge
that was triggering a Listener re-creation in 1.3.1 and 1.3.2, which was supposely resolved in 1.3.3.This is almost the same issue, but it’s only the
virtualInbound
listener that seems to change and trigger a reload. I’m not able to test right now, but to ensure this we need to get the hash of thevirtualInbound
listener send by Pilot. The change is not done everykeepaliveMaxServerConnectionAge
, but sometimes after 2 or 3. What’s changing in Pilot every once and a while ?Hmmm, we don’t have
keepaliveMaxServerConnectionAge
set, I set it explicitly to 24h and still see this so I think we’re seeing the same symptom but different root causes.