amazon-vpc-cni-k8s: aws-eks-nodeagent in CrashloopBackoff on addon update

Hi all,

We tried to update the vpc-cni addon from v1.13.4-eksbuild.1 to v1.14.0-eksbuild.3 and the update fails with the aws-eks-nodeagent being in CrashloopBackoff

We are running on EKS 1.27 and not setting anything config for aws-eks-nodeagent

A kubectl logs aws-node-98rkh aws-eks-nodeagent -n kube-system prints

{"level":"info","ts":"2023-09-04T02:21:39Z","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}

So we aren’t sure what causes the container to crash. We pass the following configuration to the vpc-cni addon

  AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG: "true"
  ENI_CONFIG_LABEL_DEF: topology.kubernetes.io/zone
  ENABLE_PREFIX_DELEGATION: "true"
  WARM_ENI_TARGET: "2"
  WARM_PREFIX_TARGET: "4"
  AWS_VPC_K8S_CNI_EXTERNALSNAT: "true"

Otherwise its all defaults.

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Reactions: 22
  • Comments: 29 (11 by maintainers)

Most upvoted comments

Thanks for providing this debugging information. Currently, the way to change the port that the node agent binds to for metrics is to pass the metrics-bind-addr command line argument to the node agent container.

We are working on making this configurable through the managed addon utility, and are deciding whether we should pass this flag with a different default than 8080

We also ran into this issue. It appears to be caused by a port conflict between the node-local-dns cache and the network policy agent. Node local dns cache binds to port 8080 on the hosts https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml#L165. The network policy agent also attempts to bind to that port {"level":"info","timestamp":"2023-09-05T17:39:15.026Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"} {"level":"error","timestamp":"2023-09-05T17:39:15.026Z","logger":"controller-runtime.metrics","msg":"metrics server failed to listen. You may want to disable the metrics server or use another port if it is due to conflicts","error":"error listening on :8080: listen tcp :8080: bind: address already in use","stacktrace":"sigs.k8s.io/controller-runtime/pkg/metrics.NewListener\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/metrics/listener.go:48\nsigs.k8s.io/controller-runtime/pkg/manager.New\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/manager/manager.go:455\nmain.main\n\t/workspace/main.go:86\nruntime.main\n\t/root/sdk/go1.20.4/src/runtime/proc.go:250"} {"level":"error","timestamp":"2023-09-05T17:39:15.026Z","logger":"setup","msg":"unable to create controller manager","error":"error listening on :8080: listen tcp :8080: bind: address already in use","stacktrace":"main.main\n\t/workspace/main.go:88\nruntime.main\n\t/root/sdk/go1.20.4/src/runtime/proc.go:250"}. To reproduce setup node local dns on a cluster with VPC CNI 1.13, then attempt to upgrade to 1.14.

The node-local-dns cache is an essential component in production Kubernetes clusters. Is there a way to modify what port the network policy agent binds to in the CNI?

Yes, this will be done as part of 1.14.1 release.

Here, we solved by passing the metrics-bind-addr in the args:

example: --metrics-bind-addr=:9985

But the issue is: painful!!!, because we are applying the add-ons with the eks terraform module, which does not support passing args

@yurrriq As called out above, the issue is due to port conflict with another application on the cluster. While 8080 is the default, it is configurable via a flag. You will be able to modify the port during the addon upgrade. No matter what port we pick as default it can potentially conflict with some application on the user end. We do understand that 8080 is a popular default option for quite a few applications out there so we’re moving to a random default port to (hopefully) avoid these conflicts.

@yurrriq AWS does not maintain terraform, and while we strive to never introduce new versions that could lead to any breakage, it can happen. The issue here is a port conflict with other applications, and we cannot control what ports other applications listen on. To decrease the likelihood of conflicts, we are changing the default port that the node agent uses to one that is less likely to conflict. Similarly, one can change the port that these other applications, such as NodeLocal DNS, uses.

The release note calls this out as a breaking change since there is a potential for port conflicts with other Kubernetes applications, but there is no breaking change from a Kubernetes/EKS API standpoint or from previous VPC CNI versions. That is why this is a minor change.

Setting VpcCni.enableNetworkPolicy=false does not prevent the node agent from starting and binding to that port for metrics. That flag is consumed by the controller. To modify the metrics port used by the node agent, you have to pass the metrics-bind-addr, as mentioned above. We plan to have the default changed and configurable through MAO in the next week.

Also can confirm we also conflicted with node-local-dns

Also, please do share on how you upgraded the addon and the output of amazon-vpc-cni configmap under kube-system namespace.

👋 we are running the vpc-cni addon on an EKS 1.27 with EKS managed bottlerocket node groups on AMI release version 1.14.3-764e37e4

Our addon update usually goes through eksctl but i also tried manually via:

aws eks update-addon --cluster-name my-cluster --addon-name vpc-cni --addon-version v1.14.0-eksbuild.3

yaml version of the config map:

apiVersion: v1
data:
  enable-network-policy-controller: "false"
  enable-windows-ipam: "false"
kind: ConfigMap
metadata:
  creationTimestamp: "2023-09-04T00:58:57Z"
  labels:
    app.kubernetes.io/instance: aws-vpc-cni
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: aws-node
    app.kubernetes.io/version: v1.14.0
    helm.sh/chart: aws-vpc-cni-1.14.0
    k8s-app: aws-node
  name: amazon-vpc-cni
  namespace: kube-system
  resourceVersion: "46392531"
  uid: 74f00062-cc21-43d6-8641-1eacc1c495aa

We have not installed any CRDs assuming this is managed by the addon. I would not expect to require any CRDs as long as we do not enable the network policy controller

This is happening during EKS cluster upgrade from 1.26 -> 1.27. Worker node AMI is still from 1.26. I’m going to update it to the latest one from 1.27 and check if that is going to help.

tried with the latest 1.27.4-20230825 Node AMI, the issue is still there, nothing in the logs of the aws-network-policy-agent which is in the CrashLoopBackOff:

│ {"level":"info","ts":"2023-09-04T12:42:45Z","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}

had to downgrade the addon back to v1.13.4-eksbuild.1

@ kramuenke-catch How did you upgrade the addon? Managed addon or via helm or via kubectl cmds? Did you enable Network policy support? EKS AMI or Custom AMI?

If you enabled Network Policy support, have you looked at the pre-reqs and/or tried following the installation steps documented here - https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html ?

For issues with EKS Node Agent - you can open the ticket in the Network policy agent repo instead - @ https://github.com/aws/aws-network-policy-agent