contour: HTTPProxy Status updates slow when there are lots of objects
What steps did you take and what happened: HTTPProxy status takes >5 minutes (more than 10 minutes sometimes) to update In a multi-tenant cluster with ~1500 Namespaces, ~1500 HTTPProxy/Ingress, ~5000 Services and ~30000 Secrets. Of the Secrets ~6000 are not of type kubernetes.io/dockercfg or kubernetes.io/service-account-token.
HTTPProxy objects have the NotReconciled status with the Waiting for controller status message for a very long time. Interestingly the .status.loadBalancer.ingress.hostname field is set almost immediately,
What did you expect to happen: HTTPProxy status to be updated in less than 3 minutes.
Anything else you would like to add: We were running Contour 1.21.3 in this cluster until a week ago with no issues so seem to have hit some sort of threshold. Initially giving the Contour Pod more resources helped speed up the status updates but did not fix the issue. Switched to Contour 1.23.2 as newer versions always have performance improvements but we still see slow HTTPProxy updates.
Briefly tried switching to 1.24.0-rc1 which has several changes that looked like they may help (#4827, #4846, #4912, #4792) but we still saw slow HTTPProxy updates and the .status.loadBalancer.ingress.hostname field was not being updated on Ingress or HTTPProxy objects. Need to spend more time investigating this though.
Environment:
- Contour version: 1.23.2
- Kubernetes version: (use
kubectl version
): - Kubernetes installer & version: Openshift 4.10 / Kubernetes 1.23
- Cloud provider or hardware configuration: on-prem, bare-metal
- OS (e.g. from
/etc/os-release
): RHEL CoreOS 410.84
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 21 (21 by maintainers)
It is worth noting that even after fixing the annotations it required restarting contour to resolve the issue as there were thousands of pending PUT requests to update the Ingress status that were pending so no updates happened at all. There should probably be some logic to ensure that updates to one object cannot monopolise the entire update queue.
As the petunias said: “Oh no, not again”.
This was a repeat of https://github.com/projectcontour/contour/issues/4608 with multiple ingress controllers both continuously updating an Ingress object status.loadBalancer field because it has no
kubernetes.io/ingress.class
annotation so the default ingress controller processes it but also has aprojectcontour.io/ingress.class
annotation so Contour processes it.For anyone who tries to debug anything like this in future the
--kubernetes-debug=6
contour command-line option really helped (as suggested above). I expected the contour_eventhandler_operation_total metric to be really high for Ingress Update but it does not include status updates (https://github.com/projectcontour/contour/issues/5007 will help with this and https://github.com/projectcontour/contour/issues/5005 to help rule out DAG issues).Thank you @sunjayBhatia and @skriss for your help in diagnosing this and apologies for the repeat issue.
Thanks @skriss. The loadBalancer field updated just fine in my much smaller dev cluster so it does definitely happen. In 1.23 it updated regardless of the other status update slowness while on 1.24-rc1 appeared to be slower.