amazon-vpc-cni-k8s: Kubelet will start scheduling pods before amazon-vpc-cni-k8s daemon set is fully functional causing workloads to error with "failed to assign an IP address to container"

Issue description

Pods are failing to create with sanbox error. We are using latest CNI (1.3.0) and there sufficient IPs available for pods

Analysis

I went through the Kubelet and aws-node logs. This happens as soon as the Worker Nodes joins the cluster and if there are any pending work loads which needs to be scheduled as soon as the Worker Nodes goes to Ready state. I see Kubelet failed to start pod because of no available IP address which tally with the ipamd daemon not having IP address in its warm pool. This is what I found so far about issue: As soon as the aws-node CNI was initialized at 2019-01-29T01:57:01Z, Kubelet gave the aws-node cni todo pods networking to assign a private ip addresses. However, aws-node CNI took over 7seconds to add ip addresses to warm pool which happened around 2019-01-29T01:57:08Z. This caused the pods to fail in those 7 seconds as they could not get a IP address with error - “failed to assign an IP address to container”.

Below are the last occurrences of the error where it is complaining about “failed to assign an IP address to container”.

Kubelet logs

cmd: failed to assign an IP address to container
Jan 29 01:57:08 ip-10-77-10-10.ec2.internal kubelet[5122]: E0129 01:57:08.022505    5122 cni.go:260] Error adding network: add cmd: failed to assign an IP address to container
Jan 29 01:57:08 ip-10-77-10-10.ec2.internal kubelet[5122]: E0129 01:57:08.022519    5122 cni.go:228] Error while adding to cni network: add cmd: failed to assign an IP address to container
Jan 29 01:57:08 ip-10-77-10-10.ec2.internal kubelet[5122]: E0129 01:57:08.377159    5122 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to set up sandbox container "b849b621ba3c224c1b42912252ded857d2a798d031d8f81065d2b1736d07771a" network for pod "workbench-rules-cc7764ccb-stjc8": NetworkPlugin cni failed to set up pod "workbench-rules-cc7764ccb-stjc8_placeable-regression" network: add cmd: failed to assign an IP address to container
Jan 29 01:57:08 ip-10-77-10-10.ec2.internal kubelet[5122]: E0129 01:57:08.377186    5122 kuberuntime_sandbox.go:56] CreatePodSandbox for pod "workbench-rules-cc7764ccb-stjc8_placeable-regression(2d3b0c4c-2369-11e9-910a-02b33f273b20)" failed: rpc error: code = Unknown desc = failed to set up sandbox container "b849b621ba3c224c1b42912252ded857d2a798d031d8f81065d2b1736d07771a" network for pod "workbench-rules-cc7764ccb-stjc8": NetworkPlugin cni failed to set up pod "workbench-rules-cc7764ccb-stjc8_placeable-regression" network: add cmd: failed to assign an IP address to container
Jan 29 01:57:08 ip-10-77-10-10.ec2.internal kubelet[5122]: E0129 01:57:08.377197    5122 kuberuntime_manager.go:646] createPodSandbox for pod "workbench-rules-cc7764ccb-stjc8_placeable-regression(2d3b0c4c-2369-11e9-910a-02b33f273b20)" failed: rpc error: code = Unknown desc = failed to set up sandbox container "b849b621ba3c224c1b42912252ded857d2a798d031d8f81065d2b1736d07771a" network for pod "workbench-rules-cc7764ccb-stjc8": NetworkPlugin cni failed to set up pod "workbench-rules-cc7764ccb-stjc8_placeable-regression" network: add cmd: failed to assign an IP address to container
Jan 29 01:57:08 ip-10-77-10-10.ec2.internal kubelet[5122]: E0129 01:57:08.377238    5122 pod_workers.go:186] Error syncing pod 2d3b0c4c-2369-11e9-910a-02b33f273b20 ("workbench-rules-cc7764ccb-stjc8_placeable-regression(2d3b0c4c-2369-11e9-910a-02b33f273b20)"), skipping: failed to "CreatePodSandbox" for "workbench-rules-cc7764ccb-stjc8_placeable-regression(2d3b0c4c-2369-11e9-910a-02b33f273b20)" with CreatePodSandboxError: "CreatePodSandbox for pod \"workbench-rules-cc7764ccb-stjc8_placeable-regression(2d3b0c4c-2369-11e9-910a-02b33f273b20)\" failed: rpc error: code = Unknown desc = failed to set up sandbox container \"b849b621ba3c224c1b42912252ded857d2a798d031d8f81065d2b1736d07771a\" network for pod \"workbench-rules-cc7764ccb-stjc8\": NetworkPlugin cni failed to set up pod \"workbench-rules-cc7764ccb-stjc8_placeable-regression\" network: add cmd: failed to assign an IP address to container"
Jan 29 01:57:08 ip-10-77-10-10.ec2.internal kubelet[5122]: I0129 01:57:08.377298    5122 server.go:460] Event(v1.ObjectReference{Kind:"Pod", Namespace:"placeable-regression", Name:"workbench-rules-cc7764ccb-stjc8", UID:"2d3b0c4c-2369-11e9-910a-02b33f273b20", APIVersion:"v1", ResourceVersion:"106928974", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "b849b621ba3c224c1b42912252ded857d2a798d031d8f81065d2b1736d07771a" network for pod "workbench-rules-cc7764ccb-stjc8": NetworkPlugin cni failed to set up pod "workbench-rules-cc7764ccb-stjc8_placeable-regression" network: add cmd: failed to assign an IP address to container`

ipamd logs

2019-01-29T01:57:07Z [INFO] Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod
2019-01-29T01:57:08Z [INFO] Received AddNetwork for NS /proc/7846/ns/net, Pod workbench-rules-cc7764ccb-stjc8, NameSpace regression, Container b849b621ba3c224c1b42912252ded857d2a798d031d8f81065d2b1736d07771a, ifname eth0
2019-01-29T01:57:08Z [DEBUG] AssignIPv4Address: IP address pool stats: total:0, assigned 0

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 17
Comments: 24 (11 by maintainers)

Most upvoted comments

Hi! We’re hitting this issue quite regularly with CNI plugin version 1.4.1, since our bursty Job workloads means node addition in the presence of a backlog of pending pods is a common occurrence, increasing the chance we hit the race condition.

Notably this issue became more of a problem for us when we upgraded to Kubernetes 1.13, since our Job definition specifies a restartPolicy of Never, which means pods that encounter the race condition then no longer get retried, due to a 1.13 regression (kubernetes/kubernetes/issues/79398).

I tested briefly using CNI plugin 1.5.0 (under Kubernetes 1.13), and couldn’t seem to reproduce the race condition, though didn’t investigate too much further for now, since we’re going to stick to Kubernetes 1.12 until the Kubernetes regression is fixed anyway.

edmorley on Jun 26, 2019

Thanks @lutierigb for the workaround.

I did little research about the issue. The best possible approach to avoid manual steps would be to include grpc health check at 127.0.0.1:50051 and make changes to manifest file to do live-probe/readiness-probe.

@mogren any thoughts ?

nithu0115 on Jun 25, 2019

v1.5.3 has a fix for this on new nodes, and the fix has been merged back to master.

mogren on Aug 22, 2019

@dng-dev, I have unreferenced commits from my local repository which was adding confusion.
By setting tcpSocket probe, kubelet is opening a socket to gRPC server which is not meaningful, since it cannot read the response body.

I am still testing if tcpsocket liveness and readiness probes is better approach or otherwise. I have run into similar problems if aws-node pod takes more than usual time to start and the goes into crashloop back off with liveliness probe failed message.

nithu0115 on Jul 9, 2019