gatekeeper: gatekeeper-controller-manager is leaking memory

What steps did you take and what happened:

On all our GKE clusters, gatekeeper-controller-manager get OOM Killed once every ~24-30h. You can see on the screenshot below that the memory keep increasing over time, which indicate to me that there is a memory leak somewhere.

gatekeeper-controller-manager memory increase

I just enabled pprof, I will add heap profile as soon as I have them

What did you expect to happen: Memory usage to be stable over time.

Anything else you would like to add: This was already happening on the previous version we were running, openpolicyagent/gatekeeper:v3.10.0-beta.0

Environment:

Gatekeeper version: openpolicyagent/gatekeeper:v3.11.0
Kubernetes version: (use kubectl version): v1.23.14-gke.1800
3 replicas of gatekeeper-controller-manager:

      --port=8443
      --health-addr=:9090
      --prometheus-port=8888
      --logtostderr
      --log-denies=true
      --emit-admission-events=true
      --log-level=DEBUG
      --exempt-namespace=gatekeeper-system
      --operation=webhook
      --enable-external-data=true
      --enable-generator-resource-expansion=false
      --log-mutations=true
      --mutation-annotations=false
      --disable-cert-rotation=false
      --max-serving-threads=-1
      --tls-min-version=1.3
      --metrics-backend=prometheus
      --enable-tls-healthcheck
      --operation=mutation-webhook
      --disable-opa-builtin={http.send}
      --exempt-namespace-prefix=kube
      --enable-pprof
      --pprof-port=6060

About this issue

Original URL
State: closed
Created a year ago
Comments: 33 (33 by maintainers)

Commits related to this issue

webhook: fix memory leak in the TLS healthcheck - The resp.Body was never closed, thus causing one connection to be leaked for each executions. - Creating a new transport based on the default tran... — committed to dethi/gatekeeper by dethi a year ago
fix: memory leak in the webhook TLS healthcheck - The resp.Body was never closed, thus causing one connection to be leaked for each executions. - Creating a new transport based on the default tran... — committed to dethi/gatekeeper by dethi a year ago
fix: memory leak in the webhook TLS healthcheck - The resp.Body was never closed, thus causing one connection to be leaked for each executions. - Creating a new transport based on the default tran... — committed to dethi/gatekeeper by dethi a year ago
fix: memory leak in the webhook TLS healthcheck - The resp.Body was never closed, thus causing one connection to be leaked for each executions. - Creating a new transport based on the default tran... — committed to sozercan/gatekeeper by dethi a year ago
fix: memory leak in the webhook TLS healthcheck - The resp.Body was never closed, thus causing one connection to be leaked for each executions. - Creating a new transport based on the default tran... — committed to sozercan/gatekeeper by dethi a year ago

Most upvoted comments

netstat showed a few connections that were still open from localhost to localhost, and some had a non-empty receiving queue. From this I knew it was gatekeeper calling itself.
From the heap profile, I knew it was not only a connection that was kept open, but also that the connection was using TLS.
I also knew that the number of goroutines was growing at a stable rate, so it was something happening regularly, every few seconds

I quickly glanced at the flags and saw --enable-tls-healthcheck: it matches all the criteria. Moreover, the default helm install has this flag disabled, we enabled it. From there I was already convinced, even before looking at the code.

dethi on Apr 10, 2023

Hey @dethi,

We’re still trying to chase down this leak, but could use a little more to go off.

We noticed a lot of http calls in the heap dump. One theory is that event emitter could be leaking connections. To this end, could you try disabling admission events (set the flag --emit-admission-events=false) and see if it still leaks?

We’re also curious about the admission webhook’s usage. Is this a relatively static cluster, where resource are not changed very frequently? Or is it more like a test cluster where resources are getting created and destroyed frequently (particularly namespaces)? If you have metrics set up, it would be helpful if could share the values/graphs for gatekeeper_validation_request_count, and also gatekeeper_validation_request_duration_seconds (the cumulative distribution).

A couple other quick questions:

Are you using mutations? If not, try disabling the mutation operation please (remove flag: --operation=mutation-webhook)
Are you using replicated data, i.e. setting sync values on the Config resource? If so, which GVKs are you syncing?

Thanks 😃

davis-haba on Apr 3, 2023

That was indeed the origin of the leak. Submitted a fix!

dethi on Apr 8, 2023

I think I found the origin of the leak using kubectl debug and netstat: the webhook TLS healthcheck. Reading the code, it does seems that we don’t ever read and close the body.

I’m gonna validate by first disabling it --enable-tls-healthcheck=false, and the submitting a patch.

dethi on Apr 8, 2023

openpolicyagent/gatekeeper:v3.10.0-beta.0 was the first version that we deployed.

dethi on Apr 3, 2023