amazon-vpc-cni-k8s: CNI does not wait for ENI and IP to be available before reporting ready when setting MINIMUM_IP_TARGET and WARM_IP_TARGET
What happened:
After setting MINIMUM_IP_TARGET
and WARM_IP_TARGET
. When scheduled on a new node, pods are stuck on failed to assign an IP address to container
for a few minutes before being able to be created.
From my observation I can see that after setting these override, aws-node does not wait for its datastore to be populated first before reporting ready causing pods to try to be scheduled onto these nodes and goes into a crashing state as no IP can be assigned
Pod log
Normal SuccessfulAttachVolume 3m16s attachdetach-controller AttachVolume.Attach succeeded for volume "XXXXXXXX"
Warning FailedCreatePodSandBox 3m12s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "22cbd7067d76c2ed509436f65384e1df59ec8949e7910d7bbbfb32de8a511f89": add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 3m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "81a9fde8b5ea0da28959e8c08e08dc2ffcbe700c7f3067afd273e9ac67b37588": add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 2m48s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a94f47242b45d3610a9bf25ee17de09d8fb6f77dcf1754deac13395020b1082f": add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 2m37s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a6f2f67496e2c75cdc89963afc5d4f0b47fe3d6988d0356b4cd0e0ffb641400c": add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 2m24s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "de2c20851923cc290e187ade06472c2f77595c5d55f7ee7522603fb55e9ee943": add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 2m10s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "4f3e15dbff1e9f6e2ad7055a61b75724d9ccd60850e05fe811966ca1cae65529": add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 115s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a545894e2066ed715d5b3667c38bf9f6ecefd45c67e79da2e8016929e84c66a0": add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 100s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9e76dd6f08280e3f246e1f366b8b0170abc345e814754797cea56e523e2f14c1": add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 88s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5c6a3409855d54fd5f296e0793eafc84202a4e825a2cea86784b54bfb3220231": add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 75s kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "0be78bc1e43e44a0f14a503c9cfbb0fdd9570d5aab0a6db9244e129c3ef294a9": add cmd: failed to assign an IP address to container
Normal Pulling 62s kubelet Pulling image "XXXXXXXXXXX"
I have been monitoring the ready state for the pod and see that it’s declared ready before ipamd is able to successfully add the ENI and its IP to the datastore
The pod status change
Thu Nov 4 11:43:31 AEDT 2021
aws-node-xnpcf 0/1 Running 0 69s 10.1.56.83 ip-10-1-56-83.ap-southeast-2.compute.internal <none> <none>
Thu Nov 4 11:43:32 AEDT 2021
aws-node-xnpcf 0/1 Running 0 70s 10.1.56.83 ip-10-1-56-83.ap-southeast-2.compute.internal <none> <none>
Thu Nov 4 11:43:33 AEDT 2021
aws-node-xnpcf 1/1 Running 0 72s 10.1.56.83 ip-10-1-56-83.ap-southeast-2.compute.internal <none> <none>
Thu Nov 4 11:43:35 AEDT 2021
aws-node-xnpcf 1/1 Running 0 73s 10.1.56.83 ip-10-1-56-83.ap-southeast-2.compute.internal <none> <none>
Readiness
date; k describe po aws-node-xnpcf | ag Ready:
Thu Nov 4 11:43:43 AEDT 2021
Ready: True
Ready: True
The node status change
ip-10-1-56-83.ap-southeast-2.compute.internal NotReady <none> 38s v1.20.11
Thu Nov 4 11:43:01 AEDT 2021
ip-10-1-56-83.ap-southeast-2.compute.internal NotReady <none> 39s v1.20.11
Thu Nov 4 11:43:02 AEDT 2021
ip-10-1-56-83.ap-southeast-2.compute.internal Ready <none> 41s v1.20.11
Thu Nov 4 11:43:03 AEDT 2021
ip-10-1-56-83.ap-southeast-2.compute.internal Ready <none> 42s v1.20.11
Inside ipamd around that time when the pod and node are declared ready I can see pods are trying to be scheduled and assigned an IP but there’s nothing in the datastore at the time
{"level":"info","ts":"2021-11-04T00:43:35.680Z","caller":"rpc/rpc.pb.go:501","msg":"Received AddNetwork for NS /var/run/netns/cni-42c1a3c7-8c69-e278-754d-efb63b6b33ea, Sandbox b624815ee518e511c3c99bdc8e66dfd1c996fe32986fadf90d9bd5235594e901, ifname eth0"}
{"level":"debug","ts":"2021-11-04T00:43:35.680Z","caller":"rpc/rpc.pb.go:501","msg":"AddNetworkRequest: ClientVersion:\"v1.9.0\" K8S_POD_NAME:\"xxxxxxxxxxx\" K8S_POD_NAMESPACE:\"xxxxxxxx\" K8S_POD_INFRA_CONTAINER_ID:\"b624815ee518e511c3c99bdc8e66dfd1c996fe32986fadf90d9bd5235594e901\" ContainerID:\"b624815ee518e511c3c99bdc8e66dfd1c996fe32986fadf90d9bd5235594e901\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/var/run/netns/cni-42c1a3c7-8c69-e278-754d-efb63b6b33ea\" "}
{"level":"debug","ts":"2021-11-04T00:43:35.680Z","caller":"ipamd/rpc_handler.go:141","msg":"AssignIPv4Address: IP address pool stats: total: 0, assigned 0"}
{"level":"debug","ts":"2021-11-04T00:43:35.680Z","caller":"ipamd/rpc_handler.go:141","msg":"AssignPodIPv4Address: ENI eni-0f7e7018aaead6bc1 does not have available addresses"}
{"level":"error","ts":"2021-11-04T00:43:35.680Z","caller":"ipamd/rpc_handler.go:141","msg":"DataStore has no available IP/Prefix addresses"}
It is not until a while later that ipamd successfully populate the datastore
{"level":"info","ts":"2021-11-04T00:45:49.425Z","caller":"ipamd/ipamd.go:917","msg":"Added ENI(eni-06129a67d2970e562)'s IP/Prefix 10.7.31.56/32 to datastore"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:904","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, totalprefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:905","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, totalprefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}
During this time period between aws-node
reporting as ready and when ipamd datastore is actually populated with IPs, pods are being scheduled on and failing which is undesirable.
What you expected to happen:
I expected a similar behaviour observed with the default WARM_ENI_TARGET=1
behaviour where the pod is not considered to be ready until the datastore adds the attached ENI is populated with IPs.
The following logs are from using the default option with no MINIMUM_IP_TARGET
and WARM_IP_TARGET
override. In this case aws-cni works as expected
Pod status
Thu Nov 4 15:55:28 AEDT 2021
aws-node-7xwp2 0/1 Running 0 69s 10.1.56.123 ip-10-1-56-123.ap-southeast-2.compute.internal <none> <none>
Thu Nov 4 15:55:29 AEDT 2021
aws-node-7xwp2 1/1 Running 0 70s 10.1.56.123 ip-10-1-56-123.ap-southeast-2.compute.internal <none> <none>
Thu Nov 4 15:55:31 AEDT 2021
aws-node-7xwp2 1/1 Running 0 72s 10.1.56.123 ip-10-1-56-123.ap-southeast-2.compute.internal <none> <none>
Ready status
Thu Nov 4 15:55:26 AEDT 2021
Ready: True
Ready: False
Thu Nov 4 15:55:27 AEDT 2021
Ready: True
Ready: False
Thu Nov 4 15:55:29 AEDT 2021
Ready: True
Ready: True
I can see in the ipamd log that the datastore successfully populate itself with the ENI’s IP around this time
{"level":"info","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:917","msg":"Added ENI(eni-035e6725b4d13e122)'s IP/Prefix 10.7.30.221/32 to datastore"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:904","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, total prefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:905","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, total prefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}
Environment:
- Kubernetes version (use
kubectl version
):v1.20.11
- CNI Version:
v1.9.0
- OS (e.g:
cat /etc/os-release
):Ubuntu 20.04.3 LTS (Focal Fossa)
- Kernel (e.g.
uname -a
):Linux xxxxxxxx 5.11.0-1019-aws #20~20.04.1-Ubuntu SMP Tue Sep 21 10:40:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 23 (7 by maintainers)
I am waiting for the code review.
Closing as fixed by https://github.com/aws/amazon-vpc-cni-k8s/pull/2354. This will ship in the next VPC CNI release
We had discussed few changes for the PR but couldn’t progress because of other commitments. I will take it up next week for changes/review and track it for 1.11.5 or 1.12.0 release. Will provide the release timelines once the PR is merged.