karpenter-provider-aws: Karpenter does not respect volume-attach-limit set for EBS volumes.
Version
Karpenter Version: v0.27.0
Kubernetes Version: v1.24.10
Expected Behavior
Hello. We run Karpenter on EKS and limit the attacheble EBS volumes by setting --volume-attach-limit on the ebs-csi-node. I would expect Karpenter to create new node, if existing nodes hit the limit already attached EBS volumes.
Actual Behavior
After the limit of attacheble EBS volumes over all nodes is hit, the PODs are stuck in pending state. No new nodes are created. All PODs show the same event:
0/5 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 3 node(s) exceed max volume count. preemption: 0/5 nodes are available: 2 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod.
Steps to Reproduce the Problem
On a test Cluster set --volume-attach-limit=5
on the ebs-csi-node. 3 Nodes available to run the following deployment with 20 replicas. I expect to get at least 1 additional node to run the entire workload.
apiVersion: apps/v1
kind: Deployment
metadata:
name: echoserver
namespace: default
spec:
replicas: 20
selector:
matchLabels:
app: echoserver
template:
metadata:
labels:
app: echoserver
spec:
containers:
- image: ealen/echo-server:latest
imagePullPolicy: IfNotPresent
name: echoserver
ports:
- name: http
containerPort: 8080
env:
- name: PORT
value: '8080'
resources:
requests:
memory: 64Mi
cpu: 10m
limits:
memory: 128Mi
cpu: 40m
securityContext:
runAsNonRoot: true
runAsUser: 101
allowPrivilegeEscalation: false
readOnlyRootFilesystem: false
capabilities:
drop:
- ALL
seccompProfile:
type: RuntimeDefault
volumeMounts:
- mountPath: "/scratch"
name: scratch-volume
securityContext:
fsGroup: 101
volumes:
- name: scratch-volume
ephemeral:
volumeClaimTemplate:
metadata:
labels:
type: my-test-volume
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 3Gi
Here you can see the limitation by the ebs driver does work:
kubectl get nodes -o json | jq '.items[] | {"nodeName": .metadata.name, "volumesInUse": .status.volumesInUse | length, "volumesAttached": .status.volumesAttached | length }'
{
"nodeName": "ip-xx-xx-xx-xx.eu-central-1.compute.internal",
"volumesInUse": 5,
"volumesAttached": 5
}
{
"nodeName": "ip-xx-xx-xx-xx.eu-central-1.compute.internal",
"volumesInUse": 5,
"volumesAttached": 5
}
{
"nodeName": "ip-xx-xx-xx-xx.eu-central-1.compute.internal",
"volumesInUse": 5,
"volumesAttached": 5
}
Also here:
k get csinode ip-xx-xx-xx-xx.eu-central-1.compute.internal -o yaml
apiVersion: storage.k8s.io/v1
kind: CSINode
metadata:
...
spec:
drivers:
- allocatable:
count: 5
name: ebs.csi.aws.com
nodeID: i-xxxxxxxxxxxx
topologyKeys:
- topology.ebs.csi.aws.com/zone
Resource Specs and Logs
There is literally nothing in the carpenter controller logs after deployment was applied.
Provisioner spec:
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
consolidation:
enabled: true
kubeletConfiguration:
clusterDNS:
- xxxxxx
maxPods: 110
limits:
resources:
cpu: 1k
providerRef:
name: private-node
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-hypervisor
operator: In
values:
- nitro
- key: karpenter.k8s.aws/instance-cpu
operator: Gt
values:
- '3'
- key: karpenter.k8s.aws/instance-cpu
operator: Lt
values:
- '129'
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- c
- m
- r
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values:
- '2'
startupTaints:
- effect: NoExecute
key: node.cilium.io/agent-not-ready
value: 'true'
ttlSecondsUntilExpired: 604800
weight: 95
Community Note
- Please vote on this issue by adding a đź‘Ť reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 13
- Comments: 16 (8 by maintainers)
If there are no logs, have you tried restarting the karpenter pods? Also, could you enable debug logs on your logging if you haven’t already? It’s difficult to debug this issue or figure out what’s going on without the logs here.
I’m not able to repro issue when I deploy a similar configuration on my cluster running the same version of Karpenter. I did a scale-up with 100 statefulSets of 1 replica each, generating 100 PVCs and a volume limit of 5 and Karpenter scaled me up to 20 nodes.
@jonathan-innis Cool. So we finally nailed down the problem. It helped indeed to set the storageClassName. Looking forward to have the fix generally available. Many thanks. 👍🏻
@sdomme Yep, you’re right. This problem is specific to
volumeClaimTemplates
that exist in the ephemeral storage volume bucket for pods. It looks like we aren’t discovering the default StorageClass name when using the default StorageClass. Can you try to specify the storageClassName as a workaround for now and see if that fixes your issue?@jonathan-innis Please find the attachments requested.
final-output.zip
Can you share the full output
-o yaml
from all the above. Also, getting the full output-o yaml
from nodes that Karpenter scheduled for these PVs would be good as well. You can attach them as files in your response if they get too long.Can you share the events from the pods that aren’t scheduling? Ideally, want to see if Karpenter thinks that the pods should schedule on your current set of nodes
This looks to be a potential duplicate of https://github.com/aws/karpenter-core/issues/260. Can you confirm if you might be hitting the edge case that is described here