gatekeeper: gatekeeper-controller-manager is leaking memory
What steps did you take and what happened:
On all our GKE clusters, gatekeeper-controller-manager get OOM Killed once every ~24-30h. You can see on the screenshot below that the memory keep increasing over time, which indicate to me that there is a memory leak somewhere.
I just enabled pprof, I will add heap profile as soon as I have them
What did you expect to happen: Memory usage to be stable over time.
Anything else you would like to add:
This was already happening on the previous version we were running, openpolicyagent/gatekeeper:v3.10.0-beta.0
Environment:
- Gatekeeper version:
openpolicyagent/gatekeeper:v3.11.0 - Kubernetes version: (use
kubectl version):v1.23.14-gke.1800 - 3 replicas of
gatekeeper-controller-manager:
--port=8443
--health-addr=:9090
--prometheus-port=8888
--logtostderr
--log-denies=true
--emit-admission-events=true
--log-level=DEBUG
--exempt-namespace=gatekeeper-system
--operation=webhook
--enable-external-data=true
--enable-generator-resource-expansion=false
--log-mutations=true
--mutation-annotations=false
--disable-cert-rotation=false
--max-serving-threads=-1
--tls-min-version=1.3
--metrics-backend=prometheus
--enable-tls-healthcheck
--operation=mutation-webhook
--disable-opa-builtin={http.send}
--exempt-namespace-prefix=kube
--enable-pprof
--pprof-port=6060
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 33 (33 by maintainers)
Commits related to this issue
- webhook: fix memory leak in the TLS healthcheck - The resp.Body was never closed, thus causing one connection to be leaked for each executions. - Creating a new transport based on the default tran... — committed to dethi/gatekeeper by dethi a year ago
- fix: memory leak in the webhook TLS healthcheck - The resp.Body was never closed, thus causing one connection to be leaked for each executions. - Creating a new transport based on the default tran... — committed to dethi/gatekeeper by dethi a year ago
- fix: memory leak in the webhook TLS healthcheck - The resp.Body was never closed, thus causing one connection to be leaked for each executions. - Creating a new transport based on the default tran... — committed to dethi/gatekeeper by dethi a year ago
- fix: memory leak in the webhook TLS healthcheck - The resp.Body was never closed, thus causing one connection to be leaked for each executions. - Creating a new transport based on the default tran... — committed to sozercan/gatekeeper by dethi a year ago
- fix: memory leak in the webhook TLS healthcheck - The resp.Body was never closed, thus causing one connection to be leaked for each executions. - Creating a new transport based on the default tran... — committed to sozercan/gatekeeper by dethi a year ago
I quickly glanced at the flags and saw
--enable-tls-healthcheck: it matches all the criteria. Moreover, the default helm install has this flag disabled, we enabled it. From there I was already convinced, even before looking at the code.Hey @dethi,
We’re still trying to chase down this leak, but could use a little more to go off.
We noticed a lot of http calls in the heap dump. One theory is that event emitter could be leaking connections. To this end, could you try disabling admission events (set the flag
--emit-admission-events=false) and see if it still leaks?We’re also curious about the admission webhook’s usage. Is this a relatively static cluster, where resource are not changed very frequently? Or is it more like a test cluster where resources are getting created and destroyed frequently (particularly namespaces)? If you have metrics set up, it would be helpful if could share the values/graphs for
gatekeeper_validation_request_count, and alsogatekeeper_validation_request_duration_seconds(the cumulative distribution).A couple other quick questions:
--operation=mutation-webhook)syncvalues on theConfigresource? If so, which GVKs are you syncing?Thanks 😃
That was indeed the origin of the leak. Submitted a fix!
I think I found the origin of the leak using
kubectl debugandnetstat: the webhook TLS healthcheck. Reading the code, it does seems that we don’t ever read and close the body.I’m gonna validate by first disabling it
--enable-tls-healthcheck=false, and the submitting a patch.openpolicyagent/gatekeeper:v3.10.0-beta.0was the first version that we deployed.