serving: The 503s are back

In what area(s)?

/area autoscale /area networking /area test-and-release

We had issues with 503s earlier and they seemed to be relieved by upgrading the Istio version. Judging by intermittent builds like this: https://prow.knative.dev/view/gcs/knative-prow/pr-logs/pull/knative_serving/4726/pull-knative-serving-integration-tests/1150657391163871233 it seems like we have had a regression of sorts with regards to that.

Things I already checked that haven’t regressed:

  • The TCP probe of the container is fine.
  • We don’t return http.StatusServiceUnavailable in either activator or queue-proxy without a body. The responses in the failed tests are responseless though.
  • Of course I wasn’t able to reproduce it locally 🙄.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 25 (25 by maintainers)

Most upvoted comments

Investigation

I enabled Envoy verbose logging and added HTTP tracing on our HTTP client. I noticed that I couldn’t see any Envoy logs for the 503 seen client side. Similarly, some HTTP 200 were not seen in the Envoy logs so I suspected that the logs were somehow dropped. Two options:

  • kail is too slow to capture all the logs
  • Envoy is not logging everything

I randomly noticed that the istio-ingressgateway was overloaded: the HPA target was 233%/80%. I increased the resource quota: https://github.com/knative/serving/pull/4734/commits/f43ee32e1019aab577c5d2fad4fc74ba82040d9d And no more 503 (in the last 6 runs, while it would consistently fail before): https://prow.knative.dev/pr-history/?org=knative&repo=serving&pr=4734 🎉 I started a few more runs to validate this actually works.

This explains somehow explain why in my PR (probing for Ingress status), I would see the 503s much more consistently because probing adds some load to istio-ingressgateway.

Next

  • Improve debugability - 1: this issue (constrained resource) should have been detected in 2 minutes (not 2 days (╯°□°)╯︵ ┻━┻) with proper metrics. I’ll work with EngProd to see if we can capture and dump metrics while running E2E tests.
  • Improve debugability - 2: the way we capture logs in E2E is not perfect. We can tweak it to capture more logs.
  • Root cause why it is overloaded. From my limited understanding of the tests, we never hammer the ingress gateway. It shouldn’t need that much resources.
  • Increase istio-proxy quotas and/or enable HPA: my change should allow us to lift the restriction of using a single pod we have today.