contour: HTTPProxy Status updates slow when there are lots of objects

What steps did you take and what happened: HTTPProxy status takes >5 minutes (more than 10 minutes sometimes) to update In a multi-tenant cluster with ~1500 Namespaces, ~1500 HTTPProxy/Ingress, ~5000 Services and ~30000 Secrets. Of the Secrets ~6000 are not of type kubernetes.io/dockercfg or kubernetes.io/service-account-token.

HTTPProxy objects have the NotReconciled status with the Waiting for controller status message for a very long time. Interestingly the .status.loadBalancer.ingress.hostname field is set almost immediately,

What did you expect to happen: HTTPProxy status to be updated in less than 3 minutes.

Anything else you would like to add: We were running Contour 1.21.3 in this cluster until a week ago with no issues so seem to have hit some sort of threshold. Initially giving the Contour Pod more resources helped speed up the status updates but did not fix the issue. Switched to Contour 1.23.2 as newer versions always have performance improvements but we still see slow HTTPProxy updates.

Briefly tried switching to 1.24.0-rc1 which has several changes that looked like they may help (#4827, #4846, #4912, #4792) but we still saw slow HTTPProxy updates and the .status.loadBalancer.ingress.hostname field was not being updated on Ingress or HTTPProxy objects. Need to spend more time investigating this though.

Environment:

  • Contour version: 1.23.2
  • Kubernetes version: (use kubectl version):
  • Kubernetes installer & version: Openshift 4.10 / Kubernetes 1.23
  • Cloud provider or hardware configuration: on-prem, bare-metal
  • OS (e.g. from /etc/os-release): RHEL CoreOS 410.84

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 21 (21 by maintainers)

Most upvoted comments

It is worth noting that even after fixing the annotations it required restarting contour to resolve the issue as there were thousands of pending PUT requests to update the Ingress status that were pending so no updates happened at all. There should probably be some logic to ensure that updates to one object cannot monopolise the entire update queue.

As the petunias said: “Oh no, not again”.

This was a repeat of https://github.com/projectcontour/contour/issues/4608 with multiple ingress controllers both continuously updating an Ingress object status.loadBalancer field because it has no kubernetes.io/ingress.class annotation so the default ingress controller processes it but also has a projectcontour.io/ingress.class annotation so Contour processes it.

For anyone who tries to debug anything like this in future the --kubernetes-debug=6 contour command-line option really helped (as suggested above). I expected the contour_eventhandler_operation_total metric to be really high for Ingress Update but it does not include status updates (https://github.com/projectcontour/contour/issues/5007 will help with this and https://github.com/projectcontour/contour/issues/5005 to help rule out DAG issues).

Thank you @sunjayBhatia and @skriss for your help in diagnosing this and apologies for the repeat issue.

Briefly tried switching to 1.24.0-rc1 which has several changes that looked like they may help (#4827, #4846, #4912, #4792) but we still saw slow HTTPProxy updates and the .status.loadBalancer.ingress.hostname field was not being updated on Ingress or HTTPProxy objects. Need to spend more time investigating this though.

I would definitely expect some of these changes that are coming in 1.24 to help here, though hard to say how much. Will take a look at the potential issue with hostname not updating – I’m definitely seeing ingress.ip being updated in my test cluster FWIW.

Thanks @skriss. The loadBalancer field updated just fine in my much smaller dev cluster so it does definitely happen. In 1.23 it updated regardless of the other status update slowness while on 1.24-rc1 appeared to be slower.