amazon-vpc-cni-k8s: CNI does not wait for ENI and IP to be available before reporting ready when setting MINIMUM_IP_TARGET and WARM_IP_TARGET

What happened:

After setting MINIMUM_IP_TARGET and WARM_IP_TARGET. When scheduled on a new node, pods are stuck on failed to assign an IP address to container for a few minutes before being able to be created.

From my observation I can see that after setting these override, aws-node does not wait for its datastore to be populated first before reporting ready causing pods to try to be scheduled onto these nodes and goes into a crashing state as no IP can be assigned

Pod log

Normal   SuccessfulAttachVolume  3m16s              attachdetach-controller  AttachVolume.Attach succeeded for volume "XXXXXXXX"
  Warning  FailedCreatePodSandBox  3m12s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "22cbd7067d76c2ed509436f65384e1df59ec8949e7910d7bbbfb32de8a511f89": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  3m                 kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "81a9fde8b5ea0da28959e8c08e08dc2ffcbe700c7f3067afd273e9ac67b37588": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  2m48s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a94f47242b45d3610a9bf25ee17de09d8fb6f77dcf1754deac13395020b1082f": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  2m37s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a6f2f67496e2c75cdc89963afc5d4f0b47fe3d6988d0356b4cd0e0ffb641400c": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  2m24s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "de2c20851923cc290e187ade06472c2f77595c5d55f7ee7522603fb55e9ee943": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  2m10s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "4f3e15dbff1e9f6e2ad7055a61b75724d9ccd60850e05fe811966ca1cae65529": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  115s               kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a545894e2066ed715d5b3667c38bf9f6ecefd45c67e79da2e8016929e84c66a0": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  100s               kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9e76dd6f08280e3f246e1f366b8b0170abc345e814754797cea56e523e2f14c1": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  88s                kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5c6a3409855d54fd5f296e0793eafc84202a4e825a2cea86784b54bfb3220231": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  75s                kubelet                  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "0be78bc1e43e44a0f14a503c9cfbb0fdd9570d5aab0a6db9244e129c3ef294a9": add cmd: failed to assign an IP address to container
  Normal   Pulling                 62s                kubelet                  Pulling image "XXXXXXXXXXX"

I have been monitoring the ready state for the pod and see that it’s declared ready before ipamd is able to successfully add the ENI and its IP to the datastore

The pod status change

Thu Nov  4 11:43:31 AEDT 2021
aws-node-xnpcf                                                          0/1   Running             0     69s   10.1.56.83    ip-10-1-56-83.ap-southeast-2.compute.internal    <none>   <none>
Thu Nov  4 11:43:32 AEDT 2021
aws-node-xnpcf                                                          0/1   Running             0     70s   10.1.56.83    ip-10-1-56-83.ap-southeast-2.compute.internal    <none>   <none>
Thu Nov  4 11:43:33 AEDT 2021
aws-node-xnpcf                                                          1/1   Running             0     72s   10.1.56.83    ip-10-1-56-83.ap-southeast-2.compute.internal    <none>   <none>
Thu Nov  4 11:43:35 AEDT 2021
aws-node-xnpcf                                                          1/1   Running             0     73s   10.1.56.83    ip-10-1-56-83.ap-southeast-2.compute.internal    <none>   <none>

Readiness

date; k describe po aws-node-xnpcf | ag Ready:
Thu Nov  4 11:43:43 AEDT 2021
    Ready:          True
    Ready:          True

The node status change

ip-10-1-56-83.ap-southeast-2.compute.internal   NotReady   <none>   38s   v1.20.11
Thu Nov  4 11:43:01 AEDT 2021
ip-10-1-56-83.ap-southeast-2.compute.internal   NotReady   <none>   39s   v1.20.11
Thu Nov  4 11:43:02 AEDT 2021
ip-10-1-56-83.ap-southeast-2.compute.internal   Ready   <none>   41s   v1.20.11
Thu Nov  4 11:43:03 AEDT 2021
ip-10-1-56-83.ap-southeast-2.compute.internal   Ready   <none>   42s   v1.20.11

Inside ipamd around that time when the pod and node are declared ready I can see pods are trying to be scheduled and assigned an IP but there’s nothing in the datastore at the time

{"level":"info","ts":"2021-11-04T00:43:35.680Z","caller":"rpc/rpc.pb.go:501","msg":"Received AddNetwork for NS /var/run/netns/cni-42c1a3c7-8c69-e278-754d-efb63b6b33ea, Sandbox b624815ee518e511c3c99bdc8e66dfd1c996fe32986fadf90d9bd5235594e901, ifname eth0"}
{"level":"debug","ts":"2021-11-04T00:43:35.680Z","caller":"rpc/rpc.pb.go:501","msg":"AddNetworkRequest: ClientVersion:\"v1.9.0\" K8S_POD_NAME:\"xxxxxxxxxxx\" K8S_POD_NAMESPACE:\"xxxxxxxx\" K8S_POD_INFRA_CONTAINER_ID:\"b624815ee518e511c3c99bdc8e66dfd1c996fe32986fadf90d9bd5235594e901\" ContainerID:\"b624815ee518e511c3c99bdc8e66dfd1c996fe32986fadf90d9bd5235594e901\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/var/run/netns/cni-42c1a3c7-8c69-e278-754d-efb63b6b33ea\" "}
{"level":"debug","ts":"2021-11-04T00:43:35.680Z","caller":"ipamd/rpc_handler.go:141","msg":"AssignIPv4Address: IP address pool stats: total: 0, assigned 0"}
{"level":"debug","ts":"2021-11-04T00:43:35.680Z","caller":"ipamd/rpc_handler.go:141","msg":"AssignPodIPv4Address: ENI eni-0f7e7018aaead6bc1 does not have available addresses"}
{"level":"error","ts":"2021-11-04T00:43:35.680Z","caller":"ipamd/rpc_handler.go:141","msg":"DataStore has no available IP/Prefix addresses"}

It is not until a while later that ipamd successfully populate the datastore

{"level":"info","ts":"2021-11-04T00:45:49.425Z","caller":"ipamd/ipamd.go:917","msg":"Added ENI(eni-06129a67d2970e562)'s IP/Prefix 10.7.31.56/32 to datastore"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:904","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, totalprefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:905","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, totalprefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}

During this time period between aws-node reporting as ready and when ipamd datastore is actually populated with IPs, pods are being scheduled on and failing which is undesirable.

What you expected to happen:

I expected a similar behaviour observed with the default WARM_ENI_TARGET=1 behaviour where the pod is not considered to be ready until the datastore adds the attached ENI is populated with IPs.

The following logs are from using the default option with no MINIMUM_IP_TARGET and WARM_IP_TARGET override. In this case aws-cni works as expected

Pod status

Thu Nov  4 15:55:28 AEDT 2021
aws-node-7xwp2                                                          0/1   Running             0     69s     10.1.56.123   ip-10-1-56-123.ap-southeast-2.compute.internal   <none>   <none>
Thu Nov  4 15:55:29 AEDT 2021
aws-node-7xwp2                                                          1/1   Running             0     70s     10.1.56.123   ip-10-1-56-123.ap-southeast-2.compute.internal   <none>   <none>
Thu Nov  4 15:55:31 AEDT 2021
aws-node-7xwp2                                                          1/1   Running             0     72s     10.1.56.123   ip-10-1-56-123.ap-southeast-2.compute.internal   <none>   <none>

Ready status

Thu Nov  4 15:55:26 AEDT 2021
    Ready:          True
    Ready:          False
Thu Nov  4 15:55:27 AEDT 2021
    Ready:          True
    Ready:          False
Thu Nov  4 15:55:29 AEDT 2021
    Ready:          True
    Ready:          True

I can see in the ipamd log that the datastore successfully populate itself with the ENI’s IP around this time

{"level":"info","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:917","msg":"Added ENI(eni-035e6725b4d13e122)'s IP/Prefix 10.7.30.221/32 to datastore"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:904","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, total prefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:905","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, total prefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}

Environment:

Kubernetes version (use kubectl version): v1.20.11
CNI Version: v1.9.0
OS (e.g: cat /etc/os-release): Ubuntu 20.04.3 LTS (Focal Fossa)
Kernel (e.g. uname -a): Linux xxxxxxxx 5.11.0-1019-aws #20~20.04.1-Ubuntu SMP Tue Sep 21 10:40:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 23 (7 by maintainers)

Most upvoted comments

I am waiting for the code review.

jayanthvn on May 5, 2022

Closing as fixed by https://github.com/aws/amazon-vpc-cni-k8s/pull/2354. This will ship in the next VPC CNI release

jdn5126 on May 12, 2023

We had discussed few changes for the PR but couldn’t progress because of other commitments. I will take it up next week for changes/review and track it for 1.11.5 or 1.12.0 release. Will provide the release timelines once the PR is merged.

jayanthvn on Sep 12, 2022