autoscaler: Memory leak?
We are running CA 1.15.5 with k8s 1.15.7. We are seeing memory gradually grow over time. We have the limits set to 1G but it will eventually reach that in about a day then get oom’d.
Here is our config in the deployment
44 - command:
45 - ./cluster-autoscaler
46 - --v=4
47 - --stderrthreshold=info
48 - --cloud-provider=aws
49 - --skip-nodes-with-local-storage=false
50 - --balance-similar-node-groups
51 - --expander=random
52 - --nodes=5:20:nodes-us-west-2b.cluster.foo.com
53 - --nodes=5:20:nodes-us-west-2c.cluster.foo.com
54 - --nodes=5:20:nodes-us-west-2d.cluster.foo.com
55 - --nodes=1:18:pgpool-nodes.cluster.foo.com
56 - --nodes=2:16:postgres-nodes.cluster.foo.com
57 - --nodes=1:4:api-nodes-us-west-2b.cluster.foo.com
58 - --nodes=1:4:api-nodes-us-west-2c.cluster.foo.com
59 - --nodes=1:4:api-nodes-us-west-2d.cluster.foo.com
60 - --nodes=0:5:cicd-nodes-us-west-2b.cluster.foo.com
61 - --nodes=0:5:cicd-nodes-us-west-2c.cluster.foo.com
62 - --nodes=0:5:cicd-nodes-us-west-2d.cluster.foo.com
63 - --nodes=0:5:haproxy-nodes-us-west-2b.cluster.foo.com
64 - --nodes=0:5:haproxy-nodes-us-west-2c.cluster.foo.com
65 - --nodes=0:5:haproxy-nodes-us-west-2d.cluster.foo.com
66 env:
67 - name: AWS_REGION
68 value: us-west-2
69 image: k8s.gcr.io/cluster-autoscaler:v1.15.5
70 imagePullPolicy: Always
71 name: cluster-autoscaler
72 resources:
73 limits:
74 cpu: 100m
75 memory: 1Gi
76 requests:
77 cpu: 100m
78 memory: 500Mi
Any thoughts?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 17
- Comments: 36 (16 by maintainers)
We just upgraded to 1.15.6 from 1.14.x and CA was OOMing on startup based on our requests and limits we had previously set, had to significantly increase these to get CA to start up.
Has anything changed to greatly increase the memory footprint?
I think there is an issue with same problem described here:
https://github.com/kubernetes/autoscaler/issues/3506
It looks like that is fixed with new version of Cluster Autoscaler.
Some message users: I’ve deployed v1.22.1 into a cluster which was previously seeing an oomkill with a memory limit of 300Mi. It’s fixed the problem for us.
same here, using AWS. there is no way this could justify 1G of consumption…
there is a leak. it’s related to the provider AWS. It can be reproduced with multiples ASG’s, multiple nodes on each, and add/delete ASG’s: nodes in and nodes out contribute to the effect.
Unfortunately, to reproduce and eventually fix this…u have to spend a reasonable amount of $ in AWS.
Well now after reverting it starts at 500mb and within 15 seconds it climbs to 950mb before k8s kills it for exceeding memory requests. HELP!
Logs
I’m seeing the same issue on a 6 node cluster, Amazon EKS v1.15, k8s.gcr.io/cluster-autoscaler:v1.15.5. CA runs into the memory limit, runs out of memory, and is then restarted. Looks like this behavior is on a ~7d schedule.