karpenter-provider-aws: Karpenter is not aware of the Custom Networking VPC CNI pod limit per node

Version

Karpenter: public.ecr.aws/karpenter/controller:v0.13.2@sha256:af463b2ab0a9b7b1fdf0991ee733dd8bcf5eabf80907f69ceddda28556aead31

Kubernetes: Server Version: v1.21.14-eks-18ef993

Expected Behavior

Karpenter is expected to not exceed the maximum amount of pods that can be scheduled on the node.

Actual Behavior

More pods get scheduled unto a node then is supported by the: https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/pkg/aws/vpc/limits.go#L509 Screen Shot 2022-08-09 at 3 45 44 PM

It’s possible that this is due to the fact that we run two CIDRs for EKS. Separate CIDR for nodes and another one for workload/pods.

Steps to Reproduce the Problem

Resource Specs and Logs

No relevant logs are observed in Karpenter.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: reviews-v1
  labels:
    app: reviews
    version: v1
spec:
  replicas: 100
  selector:
    matchLabels:
      app: reviews
      version: v1
  template:
    metadata:
      labels:
        app: reviews
        version: v1
    spec:
      serviceAccountName: bookinfo-reviews
      containers:
      - name: reviews
        image: docker.io/istio/examples-bookinfo-reviews-v1:1.16.4
        imagePullPolicy: IfNotPresent
        env:
        - name: LOG_DIR
          value: "/tmp/logs"
        ports:
        - containerPort: 9080
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: wlp-output
          mountPath: /opt/ibm/wlp/output
        securityContext:
          runAsUser: 1000
      volumes:
      - name: wlp-output
        emptyDir: {}
      - name: tmp
        emptyDir: {}
        ```
      

Warning FailedCreatePodSandBox 3m53s (x1089 over 23m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container “a2a5a5df3f2010b2e03c0f45ae2129fcfa5b3965d2f5107558a4c7dcefb450b9” network for pod “reviews-v3-7886dd86b9-kk46x”: networkPlugin cni failed to set up pod “reviews-v3-7886dd86b9-kk46x_default” network: add cmd: failed to assign an IP address to container


spec: kubeletConfiguration: containerRuntime: dockerd labels: billing: blah limits: {} providerRef: name: al2 requirements:

  • key: karpenter.sh/capacity-type operator: In values:
    • on-demand
  • key: kubernetes.io/arch operator: In values:
    • amd64 ttlSecondsAfterEmpty: 30 ttlSecondsUntilExpired: 2592000 status: resources: attachable-volumes-aws-ebs: “64” cpu: “10” ephemeral-storage: 484417496Ki memory: 32820628Ki pods: “126”

Node Information:

labels: beta.kubernetes.io/arch: amd64 beta.kubernetes.io/instance-type: t3a.small beta.kubernetes.io/os: linux failure-domain.beta.kubernetes.io/region: us-east-1 failure-domain.beta.kubernetes.io/zone: us-east-1c karpenter.k8s.aws/instance-cpu: “2” karpenter.k8s.aws/instance-family: t3a karpenter.k8s.aws/instance-hypervisor: nitro karpenter.k8s.aws/instance-memory: “2048” karpenter.k8s.aws/instance-pods: “8”

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 12
  • Comments: 50 (18 by maintainers)

Most upvoted comments

@bwagner5 the issue we see (which seems like the original report on this issues) is simply that it miscomputes eniLimitedPods because it does not factor in that one of the enis is unavailable for pods becuase it is on the worker subnet. This causes it to overestimate how many pods can be supported based on eni limits and we get into a situation where Karpenter is trying to schedule pods on a node where it has exhausted its eni’s and the cni fails to attach new enis to the node to support the pods. For instance on an m5.large it computes the pod count based on having a maximum of 3 eni’s to use when in fact it has 2 because the first one is used by the worker subnet as opposed to the pod subnet.

With what I had proposed above, we’d allow for setting how many of these enis are reserved for non-pod subnets (defaulting to 0 – maybe as an option here) and incorporate this into the eniLimitedPods calculation. Maybe there’s more to it than this but this was my initial read. I am not sure how the kubelet config would matter in this scenario since we are not setting max pods there. If this seems viable then my team would be happy to provide a PR with the change.

Looking at the inflight work that’s intended to possibly address this, it seems like the solution being proposed by Karpenter is to explicitly define every instance type via a CR and specify max pods on this. This seems undesirable to us since we’d have to configure every single instance type we’d want to use explicitly.

IP prefix allocation is something we’re investigating (we have some internal blockers on this we are working through) but it struck us as odd and inconvenient that the AWS EKS best practice with regard to custom networking wasn’t supported by default here and that we’d have to use such a workaround.

My team is hitting this issue as well. CNI custom networking is fairly common and a recommended practice in EKS (https://aws.github.io/aws-eks-best-practices/reliability/docs/networkmanagement/#run-worker-nodes-and-pods-in-different-subnets). This is blocking Karpenter adoption for us.

We’ve also looked into using ip prefixes but have some substantial obstacles to implementing this.

Would it be viable to do something simple here such as adding a toggle or setting that lets you set the number of enis reserved for the node subnet (i.e. something like https://github.com/aws/karpenter/issues/2273#issuecomment-1211292190)? The value could default to zero and be overridden by those that need this adjusted to (significantly) unblock Karpenter adoption. We’d be happy to submit an MR for this. This seems preferable to having to provide complete custom mappings for max-pods by instance if that’s what’s on the roadmap.

https://github.com/aws/karpenter/pull/3516 should address this and will be released in v0.28.0

We’ve recently rolled out Karpenter, then enabled Prefix Delegation mode on two clusters we have, and started to see frequent errors with IP allocation. Nodes spin up and a few pods become healthy (aws-node and kube-proxy at least), but then other pods get stuck with error “failed to assign an IP address to container”. It seemed strange because our subnets have plenty of free IPs.

I am guessing we’re running into @sidewinder12s’s issue, where the prefix delegation requires defragmented IP space. @sidewinder12s any advice on how to determine if that’s the case for us?

have you found a workaround for this issue yet?

Hello, Did you have any news on this issue ?

Right now, we’re recommending that you move to VPC CNI Prefix Delegation mode (or whichever CNI you are using) which allows your nodes to have significantly more IPs than without it. This means that if you are using resource requests, max-pods won’t be the limiting factor. You can then set your max-pods to some static number like 110, or setup provisioners with different steps of max-pods (50, 110, 200, …).

If you’re unable to use prefix delegation, the workaround that @jasonaliyetti is doing will help. This is where you calculate the correct max-pods value on instance launch and set the kubelet to use that value. Karpenter will be a little confused during initial provisioning, but it will recover gracefully in the next scheduling cycle. You can calculate the correct max-pods value by using the /etc/eks/max-pods-calculator.sh script which is built into the AL2 EKS Optimized AMI or you can modify the eni-max-pods.txt which is also built into the AMI which has a mapping of instance type to max-pods. Or you can just update the kubelet configuration manually within user-data.

FYI This might not be a simple ask. If your cluster consumes most IPs in a subnet, you can run into issues enabling prefix delegation because AWS must have defragmented IP space to allocate to a prefix. You’ll then get failures in launching nodes unless you totally drain out subnets. There is no built in way to see this fragmentation and it is poorly documented by AWS and was a hard blocker for us enabling this in our existing large environments.

I do this dynamically for all instance_types in all provisioners:

data "aws_ec2_instance_type" "cluster" {
  for_each = toset(local.provisioner_instance_types)
  instance_type = each.value
}

locals {
provisioner_instance_types = distinct(compact(flatten(
    [for provisioner in var.karpenter_provisioners : provisioner.instance_types]
  )))
  
  provisioner_instance_types_features = {
    for name, instance_type in data.aws_ec2_instance_type.cluster :
    name => {
      # maximum number of pods for AWS CNI custom networking, assuming only nitro instance types
      # based on: https://github.com/awslabs/amazon-eks-ami/blob/master/files/max-pods-calculator.sh
      max_pods = ((instance_type.maximum_network_interfaces - 1) * (instance_type.maximum_ipv4_addresses_per_interface - 1)) + 2
    }
  }
  
}

resource "kubectl_manifest" "karpenter_provisioners" {
  for_each = { for provisioner in var.karpenter_provisioners : provisioner.name => provisioner }

  yaml_body = <<-YAML
  apiVersion: karpenter.sh/v1alpha5
  kind: Provisioner
  ....
  spec:
    kubeletConfiguration:
      containerRuntime: containerd
      maxPods: ${min([for instance_type in each.value.instance_types : local.provisioner_instance_types_features[instance_type]["max_pods"]]...)}

ah gotcha! We’re working on a way to override max-pods per instance type where you can set your own strategy to formulate max-pods. I don’t think there’s a clean way to detect this so that Karpenter could do the proper calc for a CIDR split-up.

@tuananh This got so bad that we ended up disabling Prefix Delegation. That solved our problem for now - we aren’t seeing pods that fail to start any more, unless we scale way up and get close to running out of IPs in our subnet.

Hello,

Did you have any news on this issue ?

Right now, we’re recommending that you move to VPC CNI Prefix Delegation mode (or whichever CNI you are using) which allows your nodes to have significantly more IPs than without it. This means that if you are using resource requests, max-pods won’t be the limiting factor. You can then set your max-pods to some static number like 110, or setup provisioners with different steps of max-pods (50, 110, 200, …).

If you’re unable to use prefix delegation, the workaround that @jasonaliyetti is doing will help. This is where you calculate the correct max-pods value on instance launch and set the kubelet to use that value. Karpenter will be a little confused during initial provisioning, but it will recover gracefully in the next scheduling cycle. You can calculate the correct max-pods value by using the /etc/eks/max-pods-calculator.sh script which is built into the AL2 EKS Optimized AMI or you can modify the eni-max-pods.txt which is also built into the AMI which has a mapping of instance type to max-pods. Or you can just update the kubelet configuration manually within user-data.

If Karpenter discovered ENIConfigs in the cluster and switched our calculation, would that solve your use-case? I’m not super familiar with the ENIConfig so I may be missing something wrt how common these resources are.

Absolutely! Having ENIConfigs in the cluster is a requirement for the secondary CIDR thing to work in the first place with the VPC CNI. In theory those resources could be applied and not be in use, since the VPC CNI controller also needs to have the AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG environment variable set to true, but in practice that’s probably something that doesn’t happen too often. I guess if Karpenter also has a “please calculate max-pods based upon ENIConfig” flag, then that would be ideal. 🙏

@blakepettersson that may be possible. It seems there’s some things that Karpenter would need to do in-order to support ENIConfig such as node annotation of the ENIConfig name for the multiple subnets case too.

For future reference: https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html