amazon-vpc-cni-k8s: IPAM connectivity failed when upgrading from v1.5.5 to 1.6.0

I updated my eks cluster to 1.15.10 and that worked. Then I tried to update the cni-k8s from v1.5.5 to v1.6.0 on my test k8s test nodes(2) and as it’s a daemonset I had 1 aws-node running and the other having following error:

kubectl logs -f pod/aws-node-cjqwm -nkube-system
starting IPAM daemon in background ... ok.
checking for IPAM connectivity ...  failed.
timed out waiting for IPAM daemon to start.

I delete the pod but it’s still having the same Error:

kubectl get po --all-namespaces

NAMESPACE     NAME                             READY   STATUS    RESTARTS   AGE
kube-system   aws-node-22mnl                   1/1     Running   0          15m
kube-system   aws-node-h6nrx                   0/1     Running   3          3m9s

More details:

kubectl describe po aws-node-h6nrx -nkube-system

Events:
  Type     Reason     Age                    From                                                   Message
  ----     ------     ----                   ----                                                   -------
  Normal   Scheduled  4m49s                  default-scheduler                                      Successfully assigned kube-system/aws-node-h6nrx to ip-10-1-46-183.eu-central-1.compute.internal
  Warning  Unhealthy  3m33s                  kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Readiness probe errored: rpc error: code = Unknown desc = container not running (c542f67fbf22592a6840faa98cd3e9f1c774efeead2a6068319b0488570a903f)
  Warning  Unhealthy  2m39s                  kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Liveness probe failed: timeout: failed to connect service ":50051" within 1s
  Normal   Pulling    2m18s (x4 over 4m48s)  kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.0"
  Normal   Pulled     2m17s (x4 over 4m47s)  kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.0"
  Normal   Created    2m17s (x4 over 4m47s)  kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Created container aws-node
  Normal   Started    2m17s (x4 over 4m47s)  kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Started container aws-node
  Warning  Unhealthy  100s                   kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Liveness probe errored: rpc error: code = Unknown desc = container not running (a51a934a7d0867d500c7f9533d995ae7605ba7f80ed19186a513dd2fe62b0d88)
  Warning  BackOff    90s (x6 over 3m32s)    kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Back-off restarting failed container

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 21 (11 by maintainers)

Commits related to this issue

add configurable timeout for ipamd startup Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the entrypoint.sh script. Increases the default timeout from ~30 seconds to 60 seconds. ... — committed to jaypipes/amazon-vpc-cni-k8s by jaypipes 4 years ago
Remove timeout for ipamd startup (#874) * add configurable timeout for ipamd startup Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the entrypoint.sh script. Increases the de... — committed to aws/amazon-vpc-cni-k8s by jaypipes 4 years ago
Squashed commit of the following: commit d938e5e7590915a5126b2ee71fcc71b4ad7666f6 Author: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com> Date: Wed Jul 1 01:19:14 2020 +0000 Json... — committed to bnapolitan/amazon-vpc-cni-k8s by bnapolitan 4 years ago

Most upvoted comments

I figured out my issue, hopefully this will help someone else if they find this via Google. The aws-node serviceaccount was using a service account IAM role to provide access to the ENI EC2 API (ala https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-cni-walkthrough.html) instead of giving the node role access to the AmazonEKS_CNI_Policy.

Upgrading aws-node via eksctl overwrote the serviceaccount definition and removed the role annotation.

I fixed this by removing and re-adding the iamserviceaccount using eksctl

eksctl delete iamserviceaccount -f eksctl-cluster.yml --include kube-system/aws-node --approve
eksctl create iamserviceaccount -f eksctl-cluster.yml --approve

I have reported this to eksctl here

dylanenabled on Apr 16, 2020

Closing this issue since it has turned into a bucket of multiple upgrade issues. The things we have seen so far:

IAM for service accounts issue with eksctl
kube-proxy on Kubernetes 1.16 does no longer support the --resource-container flag
Subnet out of IP addresses

Please open a new issue if you find any new problem.

mogren on May 18, 2020

This was very hard to track down, but @mogren’s comment was what solved it for me. My cluster was created ~2 years ago and kube-proxy was still using the --resource-container flag. After the 1.16 upgrade I started seeing this “cni config uninitialized” error and all the nodes got stuck in the NotReady state.

I tried to downgrade the CNI plugin back to 1.5.x, but that also didn’t solve the problem. I had to manually edit my kube-proxy daemonset ($ kubectl edit ds kube-proxy -n kube-system) to remove the flag.

I think it’d be great to mention that in the upgrade guide.

brianstorti on May 16, 2020

@spacebarley Hi! Thanks for the logs, they made it clear that you ran into another issue:

{
  "level": "error",
  "ts": "2020-05-18T08:52:22.632Z",
  "caller": "aws-k8s-agent/main.go:30",
  "msg": "Initialization failure: failed to allocate one IP addresses on ENI eni-0aaaafcedcb7b0940e,
          err: allocate IP address: failed to allocate a private IP address: 
          InsufficientFreeAddressesInSubnet: The specified subnet does not have enough free addresses to satisfy the request.
          status code: 400, 
          request id: 0xxxxxx-a5e4-4a47-b76a-0360e364d5f1"
}

The subnet is out of IPs. First, since you were running the v1.5.x CNI earlier, check for leaked ENIs in your account. They will be marked as Available (blue dot) in the AWS Console, and have a tag, node.k8s.amazonaws.com/instance_id, that shows what instance they once belonged to.

mogren on May 18, 2020

For kube-proxy on 1.16, make sure that --resource-container is not in the spec. See Kubernetes 1.16 for details.

mogren on May 13, 2020

I’ve faced similar issues after upgrading to EKS 1.16 and upgrading VPC CNI plugin to 1.6.1 and the latest kube proxy 1.16.8.

Nodes would remain in a NotReady state
Describing them would also highlight: KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
From the logs, Kube Proxy would also remain in CrashLoopBackOff state.

After troubleshooting this with AWS Support, the combination of rolling back to our previous EKS 1.15 configuration, i.e. using AWS VPC CNI Plugin 1.5.7 and kubeproxy 1.15.11 worked for me on EKS 1.16.

Please note that terminating your existing EC2 instances might (or will?) be needed in order to get back to a running state.

Out of the 1.16 upgrade “prerequisites”, the only mandatory one, if you were already on 1.15, is to make sure you have all yaml files converted to the new API (v1) version. No more betas. https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html#1-16-prequisites

You might want to hold off any other changes for now until AWS further communicates on this issue.

JulienDefrance on May 13, 2020

Thanks for reporting the issue @njgibbon! Did you run the aws-cni-support.sh script on the node to gather the log data? It would be great if we could see why the pod failed to start. The logs should be in /var/log/aws-routed-eni/ on the worker node. We have seen issues related to kube-proxy before.

Also, if rolling back, would v1.5.7 be an option?

mogren on May 13, 2020

Hello, as with the comment above we are also seeing the same issue updating vpc-cni from v1.5.5 to v.1.6.1.

We have 4 clusters (which are theoretically all configured the same way).

All on v1.15.11-eks-af3caf. All worker nodes on the same AMI: 1.15.10-20200228.

DNS and Kube-proxy versions up to date inline with table in AWS official guide across all 4 clusters: https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html

CNI VPC plugin has been updated successfully across 3 clusters.

In the last cluster the DaemonSet rolled out successfully to 6/7 nodes.

On the last node the pod crash looped. I bounced it and it crash looped. Due to failing health. I consistently am getting this issue in the pod logs as others have pointed to.

Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ... failed.
Timed out waiting for IPAM daemon to start:

There are other workloads scheduled already on this node.

This has meant I needed to rollback to v1.5.5 only in this cluster.

I’m looking at resources and attempting to triage and may be raising to AWS Support seperately but adding here for more information on this issue occurring in general to keep the issue fresh.

njgibbon on May 12, 2020

Hi @hahasheminejad! Is there any chance you might be able to run the aws-cni-support.sh script before and after the upgrade and send the results to one of us? Either mogren@ or jaypipes@ amazon…

jaypipes on Apr 2, 2020