autoscaler: Memory leak?

We are running CA 1.15.5 with k8s 1.15.7. We are seeing memory gradually grow over time. We have the limits set to 1G but it will eventually reach that in about a day then get oom’d.

Screen Shot 2020-04-13 at 2 07 10 PM

Here is our config in the deployment

 44       - command:
 45         - ./cluster-autoscaler
 46         - --v=4
 47         - --stderrthreshold=info
 48         - --cloud-provider=aws
 49         - --skip-nodes-with-local-storage=false
 50         - --balance-similar-node-groups
 51         - --expander=random
 52         - --nodes=5:20:nodes-us-west-2b.cluster.foo.com
 53         - --nodes=5:20:nodes-us-west-2c.cluster.foo.com
 54         - --nodes=5:20:nodes-us-west-2d.cluster.foo.com
 55         - --nodes=1:18:pgpool-nodes.cluster.foo.com
 56         - --nodes=2:16:postgres-nodes.cluster.foo.com
 57         - --nodes=1:4:api-nodes-us-west-2b.cluster.foo.com
 58         - --nodes=1:4:api-nodes-us-west-2c.cluster.foo.com
 59         - --nodes=1:4:api-nodes-us-west-2d.cluster.foo.com
 60         - --nodes=0:5:cicd-nodes-us-west-2b.cluster.foo.com
 61         - --nodes=0:5:cicd-nodes-us-west-2c.cluster.foo.com
 62         - --nodes=0:5:cicd-nodes-us-west-2d.cluster.foo.com
 63         - --nodes=0:5:haproxy-nodes-us-west-2b.cluster.foo.com
 64         - --nodes=0:5:haproxy-nodes-us-west-2c.cluster.foo.com
 65         - --nodes=0:5:haproxy-nodes-us-west-2d.cluster.foo.com
 66         env:
 67         - name: AWS_REGION
 68           value: us-west-2
 69         image: k8s.gcr.io/cluster-autoscaler:v1.15.5
 70         imagePullPolicy: Always
 71         name: cluster-autoscaler
 72         resources:
 73           limits:
 74             cpu: 100m
 75             memory: 1Gi
 76           requests:
 77             cpu: 100m
 78             memory: 500Mi

Any thoughts?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 17
  • Comments: 36 (16 by maintainers)

Most upvoted comments

We just upgraded to 1.15.6 from 1.14.x and CA was OOMing on startup based on our requests and limits we had previously set, had to significantly increase these to get CA to start up.

Has anything changed to greatly increase the memory footprint?

I think there is an issue with same problem described here:

https://github.com/kubernetes/autoscaler/issues/3506

It looks like that is fixed with new version of Cluster Autoscaler.

Some message users: I’ve deployed v1.22.1 into a cluster which was previously seeing an oomkill with a memory limit of 300Mi. It’s fixed the problem for us.

same here, using AWS. there is no way this could justify 1G of consumption…

there is a leak. it’s related to the provider AWS. It can be reproduced with multiples ASG’s, multiple nodes on each, and add/delete ASG’s: nodes in and nodes out contribute to the effect.

Unfortunately, to reproduce and eventually fix this…u have to spend a reasonable amount of $ in AWS.

Well now after reverting it starts at 500mb and within 15 seconds it climbs to 950mb before k8s kills it for exceeding memory requests. HELP!

Logs

I’m seeing the same issue on a 6 node cluster, Amazon EKS v1.15, k8s.gcr.io/cluster-autoscaler:v1.15.5. CA runs into the memory limit, runs out of memory, and is then restarted. Looks like this behavior is on a ~7d schedule.