amazon-vpc-cni-k8s: aws-eks-nodeagent in CrashloopBackoff on addon update
Hi all,
We tried to update the vpc-cni addon from v1.13.4-eksbuild.1
to v1.14.0-eksbuild.3
and the update fails with the aws-eks-nodeagent
being in CrashloopBackoff
We are running on EKS 1.27 and not setting anything config for aws-eks-nodeagent
A kubectl logs aws-node-98rkh aws-eks-nodeagent -n kube-system
prints
{"level":"info","ts":"2023-09-04T02:21:39Z","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}
So we aren’t sure what causes the container to crash. We pass the following configuration to the vpc-cni addon
AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG: "true"
ENI_CONFIG_LABEL_DEF: topology.kubernetes.io/zone
ENABLE_PREFIX_DELEGATION: "true"
WARM_ENI_TARGET: "2"
WARM_PREFIX_TARGET: "4"
AWS_VPC_K8S_CNI_EXTERNALSNAT: "true"
Otherwise its all defaults.
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 22
- Comments: 29 (11 by maintainers)
Thanks for providing this debugging information. Currently, the way to change the port that the node agent binds to for metrics is to pass the
metrics-bind-addr
command line argument to the node agent container.We are working on making this configurable through the managed addon utility, and are deciding whether we should pass this flag with a different default than 8080
We also ran into this issue. It appears to be caused by a port conflict between the node-local-dns cache and the network policy agent. Node local dns cache binds to port 8080 on the hosts https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml#L165. The network policy agent also attempts to bind to that port
{"level":"info","timestamp":"2023-09-05T17:39:15.026Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"} {"level":"error","timestamp":"2023-09-05T17:39:15.026Z","logger":"controller-runtime.metrics","msg":"metrics server failed to listen. You may want to disable the metrics server or use another port if it is due to conflicts","error":"error listening on :8080: listen tcp :8080: bind: address already in use","stacktrace":"sigs.k8s.io/controller-runtime/pkg/metrics.NewListener\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/metrics/listener.go:48\nsigs.k8s.io/controller-runtime/pkg/manager.New\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/manager/manager.go:455\nmain.main\n\t/workspace/main.go:86\nruntime.main\n\t/root/sdk/go1.20.4/src/runtime/proc.go:250"} {"level":"error","timestamp":"2023-09-05T17:39:15.026Z","logger":"setup","msg":"unable to create controller manager","error":"error listening on :8080: listen tcp :8080: bind: address already in use","stacktrace":"main.main\n\t/workspace/main.go:88\nruntime.main\n\t/root/sdk/go1.20.4/src/runtime/proc.go:250"}
. To reproduce setup node local dns on a cluster with VPC CNI 1.13, then attempt to upgrade to 1.14.The node-local-dns cache is an essential component in production Kubernetes clusters. Is there a way to modify what port the network policy agent binds to in the CNI?
Yes, this will be done as part of 1.14.1 release.
Here, we solved by passing the
metrics-bind-addr
in the args:example:
--metrics-bind-addr=:9985
But the issue is: painful!!!, because we are applying the add-ons with the eks terraform module, which does not support passing
args
@yurrriq As called out above, the issue is due to port conflict with another application on the cluster. While
8080
is the default, it is configurable via a flag. You will be able to modify the port during the addon upgrade. No matter what port we pick as default it can potentially conflict with some application on the user end. We do understand that8080
is a popular default option for quite a few applications out there so we’re moving to a random default port to (hopefully) avoid these conflicts.@yurrriq AWS does not maintain terraform, and while we strive to never introduce new versions that could lead to any breakage, it can happen. The issue here is a port conflict with other applications, and we cannot control what ports other applications listen on. To decrease the likelihood of conflicts, we are changing the default port that the node agent uses to one that is less likely to conflict. Similarly, one can change the port that these other applications, such as NodeLocal DNS, uses.
The release note calls this out as a breaking change since there is a potential for port conflicts with other Kubernetes applications, but there is no breaking change from a Kubernetes/EKS API standpoint or from previous VPC CNI versions. That is why this is a minor change.
Setting
VpcCni.enableNetworkPolicy=false
does not prevent the node agent from starting and binding to that port for metrics. That flag is consumed by the controller. To modify the metrics port used by the node agent, you have to pass themetrics-bind-addr
, as mentioned above. We plan to have the default changed and configurable through MAO in the next week.Also can confirm we also conflicted with node-local-dns
👋 we are running the vpc-cni addon on an EKS 1.27 with EKS managed bottlerocket node groups on AMI release version 1.14.3-764e37e4
Our addon update usually goes through eksctl but i also tried manually via:
yaml version of the config map:
We have not installed any CRDs assuming this is managed by the addon. I would not expect to require any CRDs as long as we do not enable the network policy controller
tried with the latest
1.27.4-20230825
Node AMI, the issue is still there, nothing in the logs of the aws-network-policy-agent which is in the CrashLoopBackOff:│ {"level":"info","ts":"2023-09-04T12:42:45Z","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}
had to downgrade the addon back to
v1.13.4-eksbuild.1
@ kramuenke-catch How did you upgrade the addon? Managed addon or via helm or via kubectl cmds? Did you enable Network policy support? EKS AMI or Custom AMI?
If you enabled Network Policy support, have you looked at the pre-reqs and/or tried following the installation steps documented here - https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html ?
For issues with EKS Node Agent - you can open the ticket in the Network policy agent repo instead - @ https://github.com/aws/aws-network-policy-agent