karpenter-provider-aws: Karpenter is not aware of the Custom Networking VPC CNI pod limit per node
Version
Karpenter: public.ecr.aws/karpenter/controller:v0.13.2@sha256:af463b2ab0a9b7b1fdf0991ee733dd8bcf5eabf80907f69ceddda28556aead31
Kubernetes: Server Version: v1.21.14-eks-18ef993
Expected Behavior
Karpenter is expected to not exceed the maximum amount of pods that can be scheduled on the node.
Actual Behavior
More pods get scheduled unto a node then is supported by the: https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/pkg/aws/vpc/limits.go#L509
It’s possible that this is due to the fact that we run two CIDRs for EKS. Separate CIDR for nodes and another one for workload/pods.
Steps to Reproduce the Problem
- Setup dual cidr range in your eks cluster: https://aws.amazon.com/premiumsupport/knowledge-center/eks-multiple-cidr-ranges/
- deploy karpenter
- deploy and scale a sample service which should bring up a node which will have over-provisioned pods.
Resource Specs and Logs
No relevant logs are observed in Karpenter.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: reviews-v1
labels:
app: reviews
version: v1
spec:
replicas: 100
selector:
matchLabels:
app: reviews
version: v1
template:
metadata:
labels:
app: reviews
version: v1
spec:
serviceAccountName: bookinfo-reviews
containers:
- name: reviews
image: docker.io/istio/examples-bookinfo-reviews-v1:1.16.4
imagePullPolicy: IfNotPresent
env:
- name: LOG_DIR
value: "/tmp/logs"
ports:
- containerPort: 9080
volumeMounts:
- name: tmp
mountPath: /tmp
- name: wlp-output
mountPath: /opt/ibm/wlp/output
securityContext:
runAsUser: 1000
volumes:
- name: wlp-output
emptyDir: {}
- name: tmp
emptyDir: {}
```
Warning FailedCreatePodSandBox 3m53s (x1089 over 23m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container “a2a5a5df3f2010b2e03c0f45ae2129fcfa5b3965d2f5107558a4c7dcefb450b9” network for pod “reviews-v3-7886dd86b9-kk46x”: networkPlugin cni failed to set up pod “reviews-v3-7886dd86b9-kk46x_default” network: add cmd: failed to assign an IP address to container
spec: kubeletConfiguration: containerRuntime: dockerd labels: billing: blah limits: {} providerRef: name: al2 requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: kubernetes.io/arch
operator: In
values:
- amd64 ttlSecondsAfterEmpty: 30 ttlSecondsUntilExpired: 2592000 status: resources: attachable-volumes-aws-ebs: “64” cpu: “10” ephemeral-storage: 484417496Ki memory: 32820628Ki pods: “126”
Node Information:
labels: beta.kubernetes.io/arch: amd64 beta.kubernetes.io/instance-type: t3a.small beta.kubernetes.io/os: linux failure-domain.beta.kubernetes.io/region: us-east-1 failure-domain.beta.kubernetes.io/zone: us-east-1c karpenter.k8s.aws/instance-cpu: “2” karpenter.k8s.aws/instance-family: t3a karpenter.k8s.aws/instance-hypervisor: nitro karpenter.k8s.aws/instance-memory: “2048” karpenter.k8s.aws/instance-pods: “8”
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 12
- Comments: 50 (18 by maintainers)
@bwagner5 the issue we see (which seems like the original report on this issues) is simply that it miscomputes eniLimitedPods because it does not factor in that one of the enis is unavailable for pods becuase it is on the worker subnet. This causes it to overestimate how many pods can be supported based on eni limits and we get into a situation where Karpenter is trying to schedule pods on a node where it has exhausted its eni’s and the cni fails to attach new enis to the node to support the pods. For instance on an m5.large it computes the pod count based on having a maximum of 3 eni’s to use when in fact it has 2 because the first one is used by the worker subnet as opposed to the pod subnet.
With what I had proposed above, we’d allow for setting how many of these enis are reserved for non-pod subnets (defaulting to 0 – maybe as an option here) and incorporate this into the eniLimitedPods calculation. Maybe there’s more to it than this but this was my initial read. I am not sure how the kubelet config would matter in this scenario since we are not setting max pods there. If this seems viable then my team would be happy to provide a PR with the change.
Looking at the inflight work that’s intended to possibly address this, it seems like the solution being proposed by Karpenter is to explicitly define every instance type via a CR and specify max pods on this. This seems undesirable to us since we’d have to configure every single instance type we’d want to use explicitly.
IP prefix allocation is something we’re investigating (we have some internal blockers on this we are working through) but it struck us as odd and inconvenient that the AWS EKS best practice with regard to custom networking wasn’t supported by default here and that we’d have to use such a workaround.
My team is hitting this issue as well. CNI custom networking is fairly common and a recommended practice in EKS (https://aws.github.io/aws-eks-best-practices/reliability/docs/networkmanagement/#run-worker-nodes-and-pods-in-different-subnets). This is blocking Karpenter adoption for us.
We’ve also looked into using ip prefixes but have some substantial obstacles to implementing this.
Would it be viable to do something simple here such as adding a toggle or setting that lets you set the number of enis reserved for the node subnet (i.e. something like https://github.com/aws/karpenter/issues/2273#issuecomment-1211292190)? The value could default to zero and be overridden by those that need this adjusted to (significantly) unblock Karpenter adoption. We’d be happy to submit an MR for this. This seems preferable to having to provide complete custom mappings for max-pods by instance if that’s what’s on the roadmap.
https://github.com/aws/karpenter/pull/3516 should address this and will be released in v0.28.0
have you found a workaround for this issue yet?
FYI This might not be a simple ask. If your cluster consumes most IPs in a subnet, you can run into issues enabling prefix delegation because AWS must have defragmented IP space to allocate to a prefix. You’ll then get failures in launching nodes unless you totally drain out subnets. There is no built in way to see this fragmentation and it is poorly documented by AWS and was a hard blocker for us enabling this in our existing large environments.
I do this dynamically for all instance_types in all provisioners:
ah gotcha! We’re working on a way to override max-pods per instance type where you can set your own strategy to formulate max-pods. I don’t think there’s a clean way to detect this so that Karpenter could do the proper calc for a CIDR split-up.
@tuananh This got so bad that we ended up disabling Prefix Delegation. That solved our problem for now - we aren’t seeing pods that fail to start any more, unless we scale way up and get close to running out of IPs in our subnet.
Right now, we’re recommending that you move to VPC CNI Prefix Delegation mode (or whichever CNI you are using) which allows your nodes to have significantly more IPs than without it. This means that if you are using resource requests, max-pods won’t be the limiting factor. You can then set your max-pods to some static number like 110, or setup provisioners with different steps of max-pods (50, 110, 200, …).
If you’re unable to use prefix delegation, the workaround that @jasonaliyetti is doing will help. This is where you calculate the correct max-pods value on instance launch and set the kubelet to use that value. Karpenter will be a little confused during initial provisioning, but it will recover gracefully in the next scheduling cycle. You can calculate the correct max-pods value by using the
/etc/eks/max-pods-calculator.sh
script which is built into the AL2 EKS Optimized AMI or you can modify theeni-max-pods.txt
which is also built into the AMI which has a mapping of instance type to max-pods. Or you can just update the kubelet configuration manually within user-data.Absolutely! Having
ENIConfig
s in the cluster is a requirement for the secondary CIDR thing to work in the first place with the VPC CNI. In theory those resources could be applied and not be in use, since the VPC CNI controller also needs to have theAWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG
environment variable set totrue
, but in practice that’s probably something that doesn’t happen too often. I guess if Karpenter also has a “please calculate max-pods based upon ENIConfig” flag, then that would be ideal. 🙏@blakepettersson that may be possible. It seems there’s some things that Karpenter would need to do in-order to support ENIConfig such as node annotation of the ENIConfig name for the multiple subnets case too.
For future reference: https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html