serving: The 503s are back

In what area(s)?

/area autoscale /area networking /area test-and-release

We had issues with 503s earlier and they seemed to be relieved by upgrading the Istio version. Judging by intermittent builds like this: https://prow.knative.dev/view/gcs/knative-prow/pr-logs/pull/knative_serving/4726/pull-knative-serving-integration-tests/1150657391163871233 it seems like we have had a regression of sorts with regards to that.

Things I already checked that haven’t regressed:

The TCP probe of the container is fine.
We don’t return http.StatusServiceUnavailable in either activator or queue-proxy without a body. The responses in the failed tests are responseless though.
Of course I wasn’t able to reproduce it locally 🙄.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 25 (25 by maintainers)

Most upvoted comments

Investigation

I enabled Envoy verbose logging and added HTTP tracing on our HTTP client. I noticed that I couldn’t see any Envoy logs for the 503 seen client side. Similarly, some HTTP 200 were not seen in the Envoy logs so I suspected that the logs were somehow dropped. Two options:

kail is too slow to capture all the logs
Envoy is not logging everything

I randomly noticed that the istio-ingressgateway was overloaded: the HPA target was 233%/80%. I increased the resource quota: https://github.com/knative/serving/pull/4734/commits/f43ee32e1019aab577c5d2fad4fc74ba82040d9d And no more 503 (in the last 6 runs, while it would consistently fail before): https://prow.knative.dev/pr-history/?org=knative&repo=serving&pr=4734 🎉 I started a few more runs to validate this actually works.

This explains somehow explain why in my PR (probing for Ingress status), I would see the 503s much more consistently because probing adds some load to istio-ingressgateway.

Improve debugability - 1: this issue (constrained resource) should have been detected in 2 minutes (not 2 days (╯°□°)╯︵ ┻━┻) with proper metrics. I’ll work with EngProd to see if we can capture and dump metrics while running E2E tests.
Improve debugability - 2: the way we capture logs in E2E is not perfect. We can tweak it to capture more logs.
Root cause why it is overloaded. From my limited understanding of the tests, we never hammer the ingress gateway. It shouldn’t need that much resources.
Increase istio-proxy quotas and/or enable HPA: my change should allow us to lift the restriction of using a single pod we have today.

JRBANCEL on Jul 25, 2019

serving: The 503s are back

In what area(s)?

About this issue

Most upvoted comments

Investigation

Next