amazon-vpc-cni-k8s: vpc-cni v1.12.6 and v1.12.5 not working after update to EKS 1.27 if cluster has prometheus-adapter installed

What happened: After updating EKS cluster to 1.27 vpc-cni (installed via add-ons) stopped to work. There are no informative logs [debug severity] in /var/log/aws-routed-eni/ipamd.log on the node, and keeps restarting.

Tried several versions (installed manually from the release page) and found that: 1.12.6 - not working 1.12.5 - not working 1.12.2 - working 1.11.5 - working

No IPv6 support enabled for cluster Additional settings of VPC-CNI:

ENABLE_PREFIX_DELEGATION = "true"
WARM_PREFIX_TARGET       = "1"

Security groups is not the issue as all the changes performed on the same node, just different vpc-cni versions.

Here is logs when not working:

{"level":"info","ts":"2023-05-25T23:55:21.458Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2023-05-25T23:55:21.459Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"info","ts":"2023-05-25T23:55:21.481Z","caller":"wait/wait.go:222","msg":"Successful communication with the Cluster! Cluster Version is: v1.27+. git version: v1.27.1-eks-2f008fe. git tree state: clean. commit: abfec7d7e55d56346a5259c9379dea9f56ba2926. platform: linux/amd64"}
{"level":"info","ts":"2023-05-25T23:55:42.133Z","caller":"logger/logger.go:52","msg":"Constructed new logger instance"}
{"level":"info","ts":"2023-05-25T23:55:42.133Z","caller":"eniconfig/eniconfig.go:61","msg":"Initialized new logger as an existing instance was not found"}
{"level":"info","ts":"2023-05-25T23:55:42.141Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2023-05-25T23:55:42.142Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"info","ts":"2023-05-25T23:55:42.152Z","caller":"wait/wait.go:222","msg":"Successful communication with the Cluster! Cluster Version is: v1.27+. git version: v1.27.1-eks-2f008fe. git tree state: clean. commit: abfec7d7e55d56346a5259c9379dea9f56ba2926. platform: linux/amd64"}
{"level":"info","ts":"2023-05-25T23:56:12.100Z","caller":"logger/logger.go:52","msg":"Constructed new logger instance"}
{"level":"info","ts":"2023-05-25T23:56:12.101Z","caller":"eniconfig/eniconfig.go:61","msg":"Initialized new logger as an existing instance was not found"}
{"level":"info","ts":"2023-05-25T23:56:12.109Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2023-05-25T23:56:12.109Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"info","ts":"2023-05-25T23:56:12.124Z","caller":"wait/wait.go:222","msg":"Successful communication with the Cluster! Cluster Version is: v1.27+. git version: v1.27.1-eks-2f008fe. git tree state: clean. commit: abfec7d7e55d56346a5259c9379dea9f56ba2926. platform: linux/amd64"}
{"level":"info","ts":"2023-05-25T23:56:56.127Z","caller":"logger/logger.go:52","msg":"Constructed new logger instance"}
{"level":"info","ts":"2023-05-25T23:56:56.127Z","caller":"eniconfig/eniconfig.go:61","msg":"Initialized new logger as an existing instance was not found"}
{"level":"info","ts":"2023-05-25T23:56:56.136Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2023-05-25T23:56:56.137Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"info","ts":"2023-05-25T23:56:56.148Z","caller":"wait/wait.go:222","msg":"Successful communication with the Cluster! Cluster Version is: v1.27+. git version: v1.27.1-eks-2f008fe. git tree state: clean. commit: abfec7d7e55d56346a5259c9379dea9f56ba2926. platform: linux/amd64"}
{"level":"info","ts":"2023-05-25T23:58:27.108Z","caller":"logger/logger.go:52","msg":"Constructed new logger instance"}
{"level":"info","ts":"2023-05-25T23:58:27.108Z","caller":"eniconfig/eniconfig.go:61","msg":"Initialized new logger as an existing instance was not found"}
{"level":"info","ts":"2023-05-25T23:58:27.119Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2023-05-25T23:58:27.119Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"info","ts":"2023-05-25T23:58:27.130Z","caller":"wait/wait.go:222","msg":"Successful communication with the Cluster! Cluster Version is: v1.27+. git version: v1.27.1-eks-2f008fe. git tree state: clean. commit: abfec7d7e55d56346a5259c9379dea9f56ba2926. platform: linux/amd64"}
{"level":"info","ts":"2023-05-26T00:01:10.091Z","caller":"logger/logger.go:52","msg":"Constructed new logger instance"}
{"level":"info","ts":"2023-05-26T00:01:10.091Z","caller":"eniconfig/eniconfig.go:61","msg":"Initialized new logger as an existing instance was not found"}
{"level":"info","ts":"2023-05-26T00:01:10.103Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2023-05-26T00:01:10.104Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"info","ts":"2023-05-26T00:01:10.121Z","caller":"wait/wait.go:222","msg":"Successful communication with the Cluster! Cluster Version is: v1.27+. git version: v1.27.1-eks-2f008fe. git tree state: clean. commit: abfec7d7e55d56346a5259c9379dea9f56ba2926. platform: linux/amd64"}
{"level":"info","ts":"2023-05-26T00:05:17.506Z","caller":"logger/logger.go:52","msg":"Constructed new logger instance"}
{"level":"info","ts":"2023-05-26T00:05:17.509Z","caller":"eniconfig/eniconfig.go:61","msg":"Initialized new logger as an existing instance was not found"}
{"level":"info","ts":"2023-05-26T00:05:17.518Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2023-05-26T00:05:17.520Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}

And after upgrade to the working version:

{"level":"info","ts":"2023-05-26T00:05:19.528Z","caller":"wait/wait.go:211","msg":"Successful communication with the Cluster! Cluster Version is: v1.27+. git version: v1.27.1-eks-2f008fe. git tree state: clean. commit: abfec7d7e55d56346a5259c9379dea9f56ba2926. platform: linux/amd64"}
{"level":"warn","ts":"2023-05-26T00:05:19.562Z","caller":"awssession/session.go:64","msg":"HTTP_TIMEOUT env is not set or set to less than 10 seconds, defaulting to httpTimeout to 10sec"}
{"level":"debug","ts":"2023-05-26T00:05:19.567Z","caller":"ipamd/ipamd.go:379","msg":"Discovered region: us-east-1"}
{"level":"info","ts":"2023-05-26T00:05:19.567Z","caller":"ipamd/ipamd.go:379","msg":"Custom networking enabled false"}
{"level":"debug","ts":"2023-05-26T00:05:19.567Z","caller":"awsutils/awsutils.go:415","msg":"Found availability zone: us-east-1b "}
{"level":"debug","ts":"2023-05-26T00:05:19.568Z","caller":"awsutils/awsutils.go:415","msg":"Discovered the instance primary IPv4 address: 172.16.39.233"}
{"level":"debug","ts":"2023-05-26T00:05:19.568Z","caller":"awsutils/awsutils.go:415","msg":"Found instance-id: i-04ece13897592ae92 "}
{"level":"debug","ts":"2023-05-26T00:05:19.569Z","caller":"awsutils/awsutils.go:415","msg":"Found instance-type: t4g.medium "}
{"level":"debug","ts":"2023-05-26T00:05:19.570Z","caller":"awsutils/awsutils.go:415","msg":"Found primary interface's MAC address: 02:5b:88:ee:f2:93"}
{"level":"debug","ts":"2023-05-26T00:05:19.570Z","caller":"awsutils/awsutils.go:415","msg":"eni-00aa95a4332192cde is the primary ENI of this instance"}
{"level":"debug","ts":"2023-05-26T00:05:19.571Z","caller":"awsutils/awsutils.go:415","msg":"Found subnet-id: subnet-0de276a77839b4579 "}
{"level":"debug","ts":"2023-05-26T00:05:19.571Z","caller":"ipamd/ipamd.go:388","msg":"Using WARM_ENI_TARGET 1"}
{"level":"debug","ts":"2023-05-26T00:05:19.571Z","caller":"ipamd/ipamd.go:390","msg":"Using MINIMUM_IP_TARGET 1"}
{"level":"debug","ts":"2023-05-26T00:05:19.571Z","caller":"ipamd/ipamd.go:391","msg":"Using WARM_PREFIX_TARGET 1"}
{"level":"info","ts":"2023-05-26T00:05:19.571Z","caller":"ipamd/ipamd.go:409","msg":"Prefix Delegation enabled false"}
{"level":"debug","ts":"2023-05-26T00:05:19.571Z","caller":"ipamd/ipamd.go:414","msg":"Start node init"}
{"level":"debug","ts":"2023-05-26T00:05:19.571Z","caller":"ipamd/ipamd.go:446","msg":"Max ip per ENI 5 and max prefixes per ENI 0"}
{"level":"info","ts":"2023-05-26T00:05:19.572Z","caller":"awsutils/awsutils.go:1643","msg":"Will attempt to clean up AWS CNI leaked ENIs after waiting 4m41s."}
{"level":"info","ts":"2023-05-26T00:05:19.573Z","caller":"ipamd/ipamd.go:456","msg":"Setting up host network... "}
{"level":"debug","ts":"2023-05-26T00:05:19.573Z","caller":"networkutils/network.go:280","msg":"Trying to find primary interface that has mac : 02:5b:88:ee:f2:93"}
{"level":"debug","ts":"2023-05-26T00:05:19.576Z","caller":"networkutils/network.go:280","msg":"Discovered interface: lo, mac: "}
...

Environment:

  • Kubernetes version (use kubectl version): Server Version: version.Info{Major:“1”, Minor:“27+”, GitVersion:“v1.27.1-eks-2f008fe”, GitCommit:“abfec7d7e55d56346a5259c9379dea9f56ba2926”, GitTreeState:“clean”, BuildDate:“2023-04-14T20:40:28Z”, GoVersion:“go1.20.3”, Compiler:“gc”, Platform:“linux/amd64”}
  • CNI Version: 1.12.6
  • OS (e.g: cat /etc/os-release):
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
  • Kernel (e.g. uname -a): Linux ip-172-16-39-233.ec2.internal 5.10.178-162.673.amzn2.aarch64 #1 SMP Mon Apr 24 23:34:12 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 29 (8 by maintainers)

Most upvoted comments

Closing as this is fixed in v1.13.2 release

Currently seeing the same thing on a cluster I just upgraded to 1.27.

EDIT: Thanks @sirantd, downgrading to 1.12.2 fixed it. Spent my entire day troubleshooting this.

Awesome debugging work @sirantd ! The IPAM daemon does run a metrics server at startup: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/cmd/aws-k8s-agent/main.go#L71, and if that crashes from a bad response, it does look it would take the IPAM daemon and the aws-node pod down with it. That runs before the RPC server is started, so the logging lines up.

To resolve this on our end, it sounds like we need to update our client-go version. Note that if you do not need prometheus metrics, the metrics server can be disabled by setting DISABLE_METRICS environment variable. I also want to investigate how we can tolerate this failure without taking down the IPAM daemon.

@sirantd I do indeed! Excellent detective work. So if I’m understanding correctly, this issue is a combination of Kubernetes 1.27, Prometheus Adapter, and a specific range of client-go versions?

So, from my understanding prometheus-adapter generates a bad response. At some point, go-client library started to crash parsing it. It has been fixed but requires a lot of projects to rebuild using the patched library version. Or, maintainers of prometheus-adapter could be so kind as to issue a new version with a fix. Of course, some other components could also generate bad responses but they are more rarely used so I don’t know about them.

The client-go upgrade that will fix this crash merged in https://github.com/aws/amazon-vpc-cni-k8s/pull/2396 and will ship in v1.13.1, which is planned for the end of this month