autoscaler: Could not get a CSINode object for the node
Which component are you using?: CA
cluster-autoscaler
What version of the component are you using?: 1.20.1
k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.1
Component version:
What k8s version are you using (kubectl version)?:
1.23.5
What environment is this in?:
AWS
Could someone please tell me what this error is about? I found sometimes that it takes ages for cluster to scale up and I am wondering if this is related somehow:
I0412 08:06:16.062769 1 scheduler_binder.go:775] Could not get a CSINode object for the node "template-node-for-nodes-a.domain.net-7982597919630627426-0": csinode.storage.k8s.io "template-node-for-nodes-a.domain.net-7982597919630627426-0" not found
I0412 08:06:16.062801 1 scheduler_binder.go:801] All bound volumes for Pod "namespace/pod-75b64dff96-99vxn" match with Node "template-node-for-nodes-a.domain.net-7982597919630627426-0"
I0412 08:06:16.062828 1 filter_out_schedulable.go:157] Pod namespace.pod-75b64dff96-99vxn marked as unschedulable can be scheduled on node template-node-for-nodes-a.domain.net-7982597919630627426-0. Ignoring in scale up.
I0412 08:06:16.063127 1 scheduler_binder.go:775] Could not get a CSINode object for the node "template-node-for-nodes-c.domain.net-4246696157256546175-0": csinode.storage.k8s.io "template-node-for-nodes-c.domain.net-4246696157256546175-0" not found
I0412 08:06:16.063143 1 scheduler_binder.go:801] All bound volumes for Pod "namespace/pod-64755c698f-ghcdt" match with Node "template-node-for-nodes-c.domain.net-4246696157256546175-0"
I0412 08:06:16.063166 1 filter_out_schedulable.go:157] Pod namespace.pod-64755c698f-ghcdt marked as unschedulable can be scheduled on node template-node-for-nodes-c.domain.net-4246696157256546175-0. Ignoring in scale up.
The thing is that in each node group there is a still place for the new nodes. At least 5 in each.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 26
- Comments: 17 (3 by maintainers)
I had a brand new AWS ASG scaled to 0 and had the same issue at deploy time. It was solved by manually scaling up. Afterwards, the CAS started working as expected.
What version of CA did the fix? Could not get a CSINode object for the node “ip-10.xxx.x.xx…ap-south-1.compute.internal”: csinode.storage.k8s.io “ip-10-xxx.xx-ap-south-1.compute.internal” not found
That way of segregating node pools in zones is way older than the
aws-ebs-csi-driver.For as long as I can remember, at least 4 years, I’ve always done that, because scaling never worked 100% for multi-zones pools.
@bcouetil what you do in your example is creating the node group only in one availability zone.
This is the same is how @afirth noticed in the commend here above .
We’re still observing this issue on AWS EKS 1.24 with Cluster Autoscaler 1.26.1
I think this happens when the pod requests a PVC on AWS (or others) that is not available in the AZ of the node. The real scheduler sees that this won’t work, but the CAS “fake scheduler run” doesn’t. After awhile CAS marks the node as underutilized, kills it, and scales up again. Eventually the scale-up node lands in the right AZ, and the pod is scheduled. On other providers which support multi-zone storage, this is not a problem. solution - make a separate node group for each AZ. caveat - scale to/from 0 is broken in default EKS. Workarounds and issue at https://github.com/aws/containers-roadmap/issues/608
If CAS update really did fix it, I’m very interested in how. If it’s caused by something else, feel free to chime in here. And, feel free to chat with your AWS AM about this. https://github.com/aws/containers-roadmap/issues/608 and 724 have some of the most 👍 of all in the roadmap and aren’t particularly hard to fix.