gatekeeper: gatekeeper-controller-manager is leaking memory

What steps did you take and what happened:

On all our GKE clusters, gatekeeper-controller-manager get OOM Killed once every ~24-30h. You can see on the screenshot below that the memory keep increasing over time, which indicate to me that there is a memory leak somewhere.

gatekeeper-controller-manager memory increase

I just enabled pprof, I will add heap profile as soon as I have them

What did you expect to happen: Memory usage to be stable over time.

Anything else you would like to add: This was already happening on the previous version we were running, openpolicyagent/gatekeeper:v3.10.0-beta.0

Environment:

  • Gatekeeper version: openpolicyagent/gatekeeper:v3.11.0
  • Kubernetes version: (use kubectl version): v1.23.14-gke.1800
  • 3 replicas of gatekeeper-controller-manager:
      --port=8443
      --health-addr=:9090
      --prometheus-port=8888
      --logtostderr
      --log-denies=true
      --emit-admission-events=true
      --log-level=DEBUG
      --exempt-namespace=gatekeeper-system
      --operation=webhook
      --enable-external-data=true
      --enable-generator-resource-expansion=false
      --log-mutations=true
      --mutation-annotations=false
      --disable-cert-rotation=false
      --max-serving-threads=-1
      --tls-min-version=1.3
      --metrics-backend=prometheus
      --enable-tls-healthcheck
      --operation=mutation-webhook
      --disable-opa-builtin={http.send}
      --exempt-namespace-prefix=kube
      --enable-pprof
      --pprof-port=6060

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 33 (33 by maintainers)

Commits related to this issue

Most upvoted comments

  • netstat showed a few connections that were still open from localhost to localhost, and some had a non-empty receiving queue. From this I knew it was gatekeeper calling itself.
  • From the heap profile, I knew it was not only a connection that was kept open, but also that the connection was using TLS.
  • I also knew that the number of goroutines was growing at a stable rate, so it was something happening regularly, every few seconds

I quickly glanced at the flags and saw --enable-tls-healthcheck: it matches all the criteria. Moreover, the default helm install has this flag disabled, we enabled it. From there I was already convinced, even before looking at the code.

Hey @dethi,

We’re still trying to chase down this leak, but could use a little more to go off.

We noticed a lot of http calls in the heap dump. One theory is that event emitter could be leaking connections. To this end, could you try disabling admission events (set the flag --emit-admission-events=false) and see if it still leaks?

We’re also curious about the admission webhook’s usage. Is this a relatively static cluster, where resource are not changed very frequently? Or is it more like a test cluster where resources are getting created and destroyed frequently (particularly namespaces)? If you have metrics set up, it would be helpful if could share the values/graphs for gatekeeper_validation_request_count, and also gatekeeper_validation_request_duration_seconds (the cumulative distribution).

A couple other quick questions:

  • Are you using mutations? If not, try disabling the mutation operation please (remove flag: --operation=mutation-webhook)
  • Are you using replicated data, i.e. setting sync values on the Config resource? If so, which GVKs are you syncing?

Thanks 😃

That was indeed the origin of the leak. Submitted a fix!

I think I found the origin of the leak using kubectl debug and netstat: the webhook TLS healthcheck. Reading the code, it does seems that we don’t ever read and close the body.

I’m gonna validate by first disabling it --enable-tls-healthcheck=false, and the submitting a patch.

openpolicyagent/gatekeeper:v3.10.0-beta.0 was the first version that we deployed.