amazon-vpc-cni-k8s: IPAMD fails to start

What happened: IPAMD fails to start with iptables error. The aws-node pods fail to start and prevent worker nodes from going ready. This is occurring after updating to rocky linux 8.5 which is based on rhel 8.5.

/var/log/aws-routed-eni/ipamd.log

{"level":"error","ts":"2022-02-04T14:38:08.239Z","caller":"networkutils/network.go:385","msg":"ipt.NewChain error for chain [AWS-SNAT-CHAIN-0]: running [/usr/sbin/iptables -t nat -N AWS-SNAT-CHAIN-0 --wait]: exit status 3: iptables v1.8.4 (legacy): can't initialize iptables table `nat': Table does not exist (do you need to insmod?)\nPerhaps iptables or your kernel needs to be upgraded.\n"}

POD logs kubectl logs -n kube-system aws-node-9tqb6

{"level":"info","ts":"2022-02-04T15:11:48.035Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-02-04T15:11:48.036Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-02-04T15:11:48.062Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-02-04T15:11:48.071Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-02-04T15:11:50.092Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:52.103Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:54.115Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:56.124Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

Attach logs

What you expected to happen: Expect ipamd to start normally.

How to reproduce it (as minimally and precisely as possible): Deploy eks cluster with ami based on Rocky 8.5. In theory any rhel 8.5 could have this problem.

Anything else we need to know?: Running the iptables command from the ipamd log as root on the worker node works fine.

Environment:

  • Server Version: version.Info{Major:“1”, Minor:“19+”, GitVersion:“v1.19.15-eks-9c63c4”, GitCommit:“9c63c4037a56f9cad887ee76d55142abd4155179”, GitTreeState:“clean”, BuildDate:“2021-10-20T00:21:03Z”, GoVersion:“go1.15.15”, Compiler:“gc”, Platform:“linux/amd64”}
  • CNI: 1.10.1
  • OS (e.g: cat /etc/os-release): NAME=“Rocky Linux” VERSION=“8.5 (Green Obsidian)” ID=“rocky” ID_LIKE=“rhel centos fedora” VERSION_ID=“8.5” PLATFORM_ID=“platform:el8” PRETTY_NAME=“Rocky Linux 8.5 (Green Obsidian)” ANSI_COLOR=“0;32” CPE_NAME=“cpe:/o:rocky:rocky:8:GA” HOME_URL=“https://rockylinux.org/” BUG_REPORT_URL=“https://bugs.rockylinux.org/” ROCKY_SUPPORT_PRODUCT=“Rocky Linux” ROCKY_SUPPORT_PRODUCT_VERSION=“8”
  • Kernel (e.g. uname -a): Linux ip-10-2--xx-xxx.ec2.xxxxxxxx.com 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Wed Jan 19 17:53:40 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 25
  • Comments: 60 (13 by maintainers)

Most upvoted comments

I experienced the same issue. In our cluster, the kube-proxy version was too old. After updating it to a version that is compatible with the cluster version, the nodes started up fine.

Had same issue while upgrading, but after looking at the trouble shooting guide and patching the daemonset with the following, aws-node came up as expected and without issues.

# New env vars introduced with 1.10.x
- op: add
  path: "/spec/template/spec/initContainers/0/env/-"
  value: {"name": "ENABLE_IPv6", "value": "false"}
- op: add
  path: "/spec/template/spec/containers/0/env/-"
  value: {"name": "ENABLE_IPv4", "value": "true"}
- op: add
  path: "/spec/template/spec/containers/0/env/-"
  value: {"name": "ENABLE_IPv6", "value": "false"}

It happened in my case because the aws-node daemonset was missing the permissions to manage the IP addresses of nodes and pods. The daemonset uses the K8s service account named aws-node. Solved it by creating an IAM role with AmazonEKS_CNI_Policy and attaching the role to the service account. To attach the role, add an annotation to the service account named aws-node and restart the daemonset.

 annotations:
       eks.amazonaws.com/role-arn: your-role-arn

As mentioned in some answers, it’s not a good security practise to attach the AmazonEKS_CNI_Policy to the nodes directly, refer https://aws.github.io/aws-eks-best-practices/networking/vpc-cni/#use-separate-iam-role-for-cni to know more.

Hey all 👋🏼 please be aware that this failure mode happens also when the IPs for a subnet are exhausted.

I just faced this and noticed I had mis-configured my worker groups to use a small subnet (/26) instead of a bigger one I intended to use (/18).

For those coming here after upgrading EKS try re-applying the VPC CNI manifest file, for example: kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.3/config/master/aws-k8s-cni.yaml

We found an alternative way of fixing it by updating iptables inside the CNI container image.

from 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.10.1
run yum install iptables-nft -y
run cd /usr/sbin && rm iptables && ln -s xtables-nft-multi iptables

My concern is the direction of RHEL and downstream distros seems to be away from iptables-legacy and to iptables-nft. Is there any plans to release address this in the CNI container image?

I experienced the same issue. In our cluster, the kube-proxy version was too old. After updating it to a version that is compatible with the cluster version, the nodes started up fine.

Same here, this is from upgrading from Kubernetes 1.21 -> 1.25 under AWS EKS where aws-node failed to start with these logs:

kubectl logs of aws-node did not reveal much:

time="2023-10-06T20:07:03Z" level=info msg="Starting IPAM daemon... "
time="2023-10-06T20:07:03Z" level=info msg="Checking for IPAM connectivity... "

had to login to the aws-node manually (from https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md)

kubectl exec -it aws-node-vd5r8 -n kube-system -c aws-eks-nodeagent /bin/bash

find the log file under /var/log/aws-routed-eni, should be a file called ipamd*.log

{"level":"error","ts":"2023-10-06T20:27:43.658Z","caller":"wait/wait.go:109","msg":"Unable to reach API Server, Get \"[https://10.100.0.1:443/version?timeout=5s](https://10.100.0.1/version?timeout=5s)\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
{"level":"error","ts":"2023-10-06T20:27:49.658Z","caller":"wait/wait.go:109","msg":"Unable to reach API Server, Get \"[https://10.100.0.1:443/version?timeout=5s](https://10.100.0.1/version?timeout=5s)\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
{"level":"error","ts":"2023-10-06T20:27:55.657Z","caller":"wait/wait.go:109","msg":"Unable to reach API Server, Get \"[https://10.100.0.1:443/version?timeout=5s](https://10.100.0.1/version?timeout=5s)\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
{"level":"error","ts":"2023-10-06T20:28:01.658Z","caller":"wait/wait.go:109","msg":"Unable to reach API Server, Get \"[https://10.100.0.1:443/version?timeout=5s](https://10.100.0.1/version?timeout=5s)\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}

The only thing I did not upgrade in the cluster was kube-proxy so I assumed that was the issue and it was, glad someone else had the same experience.

Make sure to go through each of these… I did not have addons for anything so I had to go through the self managed rout which sucks. I hope there is maybe a way to go from self managed to addons?

@ermiaqasemi From this tutorial I chose to attach the AmazonEKS_CNI_Policy to the aws-node service account and I was getting the error.

I decided to try simply attaching it to the AmazonEKSNodeRole, which apparently is the less recommended way to do it, but it works.

For me, the issue was policy/AmazonEKS_CNI_Policy-2022092909143815010000000b My policy only allowed IPV6 like below.

{
    "Statement": [
        {
            "Action": [
                "ec2:DescribeTags",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypes",
                "ec2:AssignIpv6Addresses"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "IPV6"
        },
        {
            "Action": "ec2:CreateTags",
            "Effect": "Allow",
            "Resource": "arn:aws:ec2:*:*:network-interface/*",
            "Sid": "CreateTags"
        }
    ],
    "Version": "2012-10-17"
}

I changed the policy like below:

{
    "Statement": [
        {
            "Action": [
                "ec2:UnassignPrivateIpAddresses",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:DetachNetworkInterface",
                "ec2:DescribeTags",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypes",
                "ec2:DeleteNetworkInterface",
                "ec2:CreateNetworkInterface",
                "ec2:AttachNetworkInterface",
                "ec2:AssignPrivateIpAddresses"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "IPV4"
        },
        {
            "Action": [
                "ec2:DescribeTags",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypes",
                "ec2:AssignIpv6Addresses"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "IPV6"
        },
        {
            "Action": "ec2:CreateTags",
            "Effect": "Allow",
            "Resource": "arn:aws:ec2:*:*:network-interface/*",
            "Sid": "CreateTags"
        }
    ],
    "Version": "2012-10-17"
}

and it works! 😅

I experienced the same issue. In our cluster, the kube-proxy version was too old. After updating it to a version that is compatible with the cluster version, the nodes started up fine.

This also fixed our problem - thank’s a million for this hint!

For those coming here after upgrading EKS try re-applying the VPC CNI manifest file, for example: kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.3/config/master/aws-k8s-cni.yaml

@esidate Thanks! This fix it for me as well.

@jayanthvn, nope. Missed to do that. I’ll report back should the issue reoccur. Planning to upgrade to v1.12 tomorrow so maybe I will recycle a few nodes before performing the upgrade to reproduce the issue.

I was able to fix this by downgrading the helm chart to 1.1.21 from 1.2.0

image: amazon-k8s-cni:v1.11.3-eksbuild.1 EKS : v1.21.14-eks

I think it was due to using amazon-k8s-cni:v1.11.3-eksbuild.1 on chart 1.2.0 that’s why I was getting this error.

with helm chart 1.2.0

user@DESKTOP:~$ k logs -f aws-node-qjnb2
Defaulted container "aws-node" out of: aws-node, aws-vpc-cni-init (init)
{"level":"info","ts":"2022-11-15T05:02:53.761Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-11-15T05:02:53.764Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-11-15T05:02:53.794Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-11-15T05:02:53.798Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-11-15T05:02:55.816Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-15T05:02:57.828Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-15T05:02:59.840Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-15T05:03:01.853Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-15T05:03:03.866Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-15T05:03:05.879Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-15T05:03:07.892Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-15T05:03:09.906Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-15T05:03:11.918Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-15T05:03:13.931Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-15T05:03:15.943Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

with helm chart 1.1.21

user@DESKTOP:~$ k logs -f aws-node-7r7v2
Defaulted container "aws-node" out of: aws-node, aws-vpc-cni-init (init)
{"level":"info","ts":"2022-11-15T05:41:48.135Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-11-15T05:41:48.138Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-11-15T05:41:48.169Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-11-15T05:41:48.173Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-11-15T05:41:50.186Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-15T05:41:52.198Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-15T05:41:52.234Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2022-11-15T05:41:52.251Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2022-11-15T05:41:52.253Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

@jayanthvn Yeah they have! anyway, I have managed to fix the problem by updating to v1.11.4-eksbuild.1 and using 5.4.209-116.363.amzn2.x86_64 AMI version. So far we don’t have this issue anymore.

We found by loading ip_tables, iptable_nat, and iptable_mangle kernel modules fixes the issue: modprobe ip_tables iptable_nat iptable_mangle

Still trying to figure out why these modules where loaded be default in 8.4 and not in 8.5. Also still not sure why the same iptables commands work without these modules directly on the worker instance and not in the container.