contour: HTTPProxy Status incorrect

What steps did you take and what happened:

Create an HTTPProxy, Secret, Service and Deployment - status of HTTPProxy is valid as expected. A few days later delete the HTTPProxy, Secret, Service and Deployment, then only recreate the HTTPProxy, Service and Deployment - status of HTTPProxy goes to Valid which is NOT expected as the Secret object no longer exists. It appears that the previous Secret state is cached and not updated.

We have also seen numerous cases of the opposite problem where an HTTPProxy shows as invalid complaining about a missing Secret when it does actually exist.

This does not always happen but in one relaively busy cluster we see issues consistently after a few days. Restarting Contour is the only way to resolve these issues.

What did you expect to happen:

HTTPProxy Status should accurately reflect the existence of the Secret object.

Anything else you would like to add:

Contour logs show an XDS error for the Secret where the status is incorrect:

$ kubectl logs contour-74cbf6dd55-4857n | grep explorer-tls
time="2022-07-04T16:12:09Z" level=error msg="stream terminated" connection=8498 context=xds error="rpc error: code = Canceled desc = context canceled" node_id=envoy-5c7d955d96-5thq6 node_version=v1.22.2 resource_names="[burdik/explorer-tls/88eb5c978f]" response_nonce=661317 type_url=type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret version_info=661317

Not sure if it is related but running contour with --debug I see a continuous stream of messages from the leader:

time="2022-07-04T18:23:03Z" level=debug msg="added ServiceCluster \"burdik/explorer/http\" from DAG" context=endpointstranslator
time="2022-07-04T18:23:03Z" level=debug msg="added ServiceCluster \"burdik/shell/http\" from DAG" context=endpointstranslator
time="2022-07-04T18:23:03Z" level=debug msg="added ServiceCluster \"burdik/sleeper/http\" from DAG" context=endpointstranslator
time="2022-07-04T18:23:03Z" level=debug msg="dropping service cluster with duplicate name \"burdik/explorer/http\"" context=endpointstranslator
time="2022-07-04T18:23:03Z" level=debug msg="dropping service cluster with duplicate name \"burdik/shell/http\"" context=endpointstranslator
time="2022-07-04T18:23:03Z" level=debug msg="dropping service cluster with duplicate name \"burdik/sleeper/http\"" context=endpointstranslator
time="2022-07-04T18:23:04Z" level=debug msg="received a status update" context=StatusUpdateHandler name=shell namespace=burdik
time="2022-07-04T18:23:04Z" level=debug msg="update was a no-op" context=StatusUpdateHandler name=shell namespace=burdik
time="2022-07-04T18:23:11Z" level=debug msg="received a status update" context=StatusUpdateHandler name=explorer namespace=burdik
time="2022-07-04T18:23:11Z" level=debug msg="update was a no-op" context=StatusUpdateHandler name=explorer namespace=burdik
time="2022-07-04T18:23:12Z" level=debug msg="received a status update" context=StatusUpdateHandler name=shell namespace=burdik
time="2022-07-04T18:23:12Z" level=debug msg="update was a no-op" context=StatusUpdateHandler name=shell namespace=burdik
time="2022-07-04T18:23:19Z" level=debug msg="received a status update" context=StatusUpdateHandler name=sleeper namespace=burdik
time="2022-07-04T18:23:19Z" level=debug msg="update was a no-op" context=StatusUpdateHandler name=sleeper namespace=burdik
time="2022-07-04T18:23:27Z" level=debug msg="added ServiceCluster \"burdik/explorer/http\" from DAG" context=endpointstranslator
time="2022-07-04T18:23:27Z" level=debug msg="added ServiceCluster \"burdik/shell/http\" from DAG" context=endpointstranslator
time="2022-07-04T18:23:27Z" level=debug msg="added ServiceCluster \"burdik/sleeper/http\" from DAG" context=endpointstranslator
time="2022-07-04T18:23:27Z" level=debug msg="dropping service cluster with duplicate name \"burdik/explorer/http\"" context=endpointstranslator
time="2022-07-04T18:23:27Z" level=debug msg="dropping service cluster with duplicate name \"burdik/shell/http\"" context=endpointstranslator
time="2022-07-04T18:23:27Z" level=debug msg="dropping service cluster with duplicate name \"burdik/sleeper/http\"" context=endpointstranslator

Environment:

  • Contour version: v1.21.1
  • Envoy version: v1.22.2
  • Kubernetes version: (use kubectl version): v1.21.8+ed4d8fd
  • Kubernetes installer & version: OpenShift 4.8.
  • Cloud provider or hardware configuration: on-prem bare metal
  • OS (e.g. from /etc/os-release): Red Hat Enterprise Linux CoreOS 48.84.202204202010-0

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (21 by maintainers)

Commits related to this issue

Most upvoted comments

Fixing the annotations to stop the two ingress controllers fighting over the Ingress status fields resulting in lots of updates to the status update channel resolved the issue.

In summary, no problem with Contour. Thank you to @tsaarni @youngnick, @skriss and others for looking into this.

Thanks for merging @skriss. Initial testing using the image from main indicates the issue is resolved, but we’ll wait for several more hours of testing before declaring victory.

It seems like there’s something weird going on with Secret resources falling out of reconciliation and staying there. That’s the only thing I can think of that would explain this. It’s likely that the changes we made to more aggressively drop Secrets from the cache have something to do with it, but we’ll need to review the code pretty closely in the absence of a reliable reproduction.

That’s a long and indirect way of saying that one of us is going to have to go up to our eyeballs in that Secret handling code and see if we can figure out why some Secrets are being marked as definitely not relevant even across Kubernetes update events.