amazon-vpc-cni-k8s: IPAMD fails to start
What happened: IPAMD fails to start with iptables error. The aws-node pods fail to start and prevent worker nodes from going ready. This is occurring after updating to rocky linux 8.5 which is based on rhel 8.5.
/var/log/aws-routed-eni/ipamd.log
{"level":"error","ts":"2022-02-04T14:38:08.239Z","caller":"networkutils/network.go:385","msg":"ipt.NewChain error for chain [AWS-SNAT-CHAIN-0]: running [/usr/sbin/iptables -t nat -N AWS-SNAT-CHAIN-0 --wait]: exit status 3: iptables v1.8.4 (legacy): can't initialize iptables table `nat': Table does not exist (do you need to insmod?)\nPerhaps iptables or your kernel needs to be upgraded.\n"}
POD logs kubectl logs -n kube-system aws-node-9tqb6
{"level":"info","ts":"2022-02-04T15:11:48.035Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-02-04T15:11:48.036Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-02-04T15:11:48.062Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-02-04T15:11:48.071Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-02-04T15:11:50.092Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:52.103Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:54.115Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:56.124Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
Attach logs
What you expected to happen: Expect ipamd to start normally.
How to reproduce it (as minimally and precisely as possible): Deploy eks cluster with ami based on Rocky 8.5. In theory any rhel 8.5 could have this problem.
Anything else we need to know?: Running the iptables command from the ipamd log as root on the worker node works fine.
Environment:
- Server Version: version.Info{Major:“1”, Minor:“19+”, GitVersion:“v1.19.15-eks-9c63c4”, GitCommit:“9c63c4037a56f9cad887ee76d55142abd4155179”, GitTreeState:“clean”, BuildDate:“2021-10-20T00:21:03Z”, GoVersion:“go1.15.15”, Compiler:“gc”, Platform:“linux/amd64”}
- CNI: 1.10.1
- OS (e.g:
cat /etc/os-release
): NAME=“Rocky Linux” VERSION=“8.5 (Green Obsidian)” ID=“rocky” ID_LIKE=“rhel centos fedora” VERSION_ID=“8.5” PLATFORM_ID=“platform:el8” PRETTY_NAME=“Rocky Linux 8.5 (Green Obsidian)” ANSI_COLOR=“0;32” CPE_NAME=“cpe:/o:rocky:rocky:8:GA” HOME_URL=“https://rockylinux.org/” BUG_REPORT_URL=“https://bugs.rockylinux.org/” ROCKY_SUPPORT_PRODUCT=“Rocky Linux” ROCKY_SUPPORT_PRODUCT_VERSION=“8” - Kernel (e.g.
uname -a
):Linux ip-10-2--xx-xxx.ec2.xxxxxxxx.com 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Wed Jan 19 17:53:40 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 25
- Comments: 60 (13 by maintainers)
I experienced the same issue. In our cluster, the
kube-proxy
version was too old. After updating it to a version that is compatible with the cluster version, the nodes started up fine.Had same issue while upgrading, but after looking at the trouble shooting guide and patching the daemonset with the following,
aws-node
came up as expected and without issues.It happened in my case because the aws-node daemonset was missing the permissions to manage the IP addresses of nodes and pods. The daemonset uses the K8s service account named aws-node. Solved it by creating an IAM role with AmazonEKS_CNI_Policy and attaching the role to the service account. To attach the role, add an annotation to the service account named aws-node and restart the daemonset.
As mentioned in some answers, it’s not a good security practise to attach the AmazonEKS_CNI_Policy to the nodes directly, refer https://aws.github.io/aws-eks-best-practices/networking/vpc-cni/#use-separate-iam-role-for-cni to know more.
Hey all 👋🏼 please be aware that this failure mode happens also when the IPs for a subnet are exhausted.
I just faced this and noticed I had mis-configured my worker groups to use a small subnet (/26) instead of a bigger one I intended to use (/18).
For those coming here after upgrading EKS try re-applying the VPC CNI manifest file, for example: kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.3/config/master/aws-k8s-cni.yaml
We found an alternative way of fixing it by updating iptables inside the CNI container image.
My concern is the direction of RHEL and downstream distros seems to be away from iptables-legacy and to iptables-nft. Is there any plans to release address this in the CNI container image?
Same here, this is from upgrading from Kubernetes 1.21 -> 1.25 under AWS EKS where
aws-node
failed to start with these logs:kubectl logs of
aws-node
did not reveal much:had to login to the aws-node manually (from https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md)
kubectl exec -it aws-node-vd5r8 -n kube-system -c aws-eks-nodeagent /bin/bash
find the log file under
/var/log/aws-routed-eni
, should be a file calledipamd*.log
The only thing I did not upgrade in the cluster was
kube-proxy
so I assumed that was the issue and it was, glad someone else had the same experience.Make sure to go through each of these… I did not have addons for anything so I had to go through the self managed rout which sucks. I hope there is maybe a way to go from self managed to addons?
@ermiaqasemi From this tutorial I chose to attach the
AmazonEKS_CNI_Policy
to theaws-node
service account and I was getting the error.I decided to try simply attaching it to the
AmazonEKSNodeRole
, which apparently is the less recommended way to do it, but it works.For me, the issue was
policy/AmazonEKS_CNI_Policy-2022092909143815010000000b
My policy only allowed IPV6 like below.I changed the policy like below:
and it works! 😅
This also fixed our problem - thank’s a million for this hint!
@esidate Thanks! This fix it for me as well.
@jayanthvn, nope. Missed to do that. I’ll report back should the issue reoccur. Planning to upgrade to v1.12 tomorrow so maybe I will recycle a few nodes before performing the upgrade to reproduce the issue.
I was able to fix this by downgrading the helm chart to
1.1.21
from1.2.0
image: amazon-k8s-cni:v1.11.3-eksbuild.1 EKS : v1.21.14-eks
I think it was due to using
amazon-k8s-cni:v1.11.3-eksbuild.1
on chart1.2.0
that’s why I was getting this error.with helm chart 1.2.0
with helm chart 1.1.21
@jayanthvn Yeah they have! anyway, I have managed to fix the problem by updating to v1.11.4-eksbuild.1 and using
5.4.209-116.363.amzn2.x86_64
AMI version. So far we don’t have this issue anymore.We found by loading ip_tables, iptable_nat, and iptable_mangle kernel modules fixes the issue:
modprobe ip_tables iptable_nat iptable_mangle
Still trying to figure out why these modules where loaded be default in 8.4 and not in 8.5. Also still not sure why the same iptables commands work without these modules directly on the worker instance and not in the container.