serving: The 503s are back
In what area(s)?
/area autoscale /area networking /area test-and-release
We had issues with 503s earlier and they seemed to be relieved by upgrading the Istio version. Judging by intermittent builds like this: https://prow.knative.dev/view/gcs/knative-prow/pr-logs/pull/knative_serving/4726/pull-knative-serving-integration-tests/1150657391163871233 it seems like we have had a regression of sorts with regards to that.
Things I already checked that haven’t regressed:
- The TCP probe of the container is fine.
- We don’t return
http.StatusServiceUnavailablein either activator or queue-proxy without a body. The responses in the failed tests are responseless though. - Of course I wasn’t able to reproduce it locally 🙄.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 25 (25 by maintainers)
Investigation
I enabled Envoy verbose logging and added HTTP tracing on our HTTP client. I noticed that I couldn’t see any Envoy logs for the 503 seen client side. Similarly, some HTTP 200 were not seen in the Envoy logs so I suspected that the logs were somehow dropped. Two options:
I randomly noticed that the istio-ingressgateway was overloaded: the HPA target was 233%/80%. I increased the resource quota: https://github.com/knative/serving/pull/4734/commits/f43ee32e1019aab577c5d2fad4fc74ba82040d9d And no more 503 (in the last 6 runs, while it would consistently fail before): https://prow.knative.dev/pr-history/?org=knative&repo=serving&pr=4734 🎉 I started a few more runs to validate this actually works.
This explains somehow explain why in my PR (probing for Ingress status), I would see the 503s much more consistently because probing adds some load to istio-ingressgateway.
Next