amazon-vpc-cni-k8s: Container ID inconsistency results in deleted vlan device

What happened:

On several occasions, we’ve encountered an issue where pods scheduled on a node start with broken network connectivity and never recover. The containers within the pod will attempt to start repeatedly, but since the applications within depend on network connectivity, the containers exit in failure each time and the pod remains in CrashLoopBackoff.

Each time I’ve observed this personally, I’ve noticed it seems to be associated with many pods being scheduled on the node simultaneously. All of the pods scheduled around the same time have the same issue. However, the issue is not permanent - pods scheduled sometime after the issue has occurred start with functional network connectivity, and continue to work as expected.

After some investigation, I believe I’ve identified the immediate cause of the broken network connectivity. However, I’m unsure of the root cause, and whether the vpc-cni should be preventing the condition from occurring.

Comparing a healthy pod and an unhealthy pod from the same node, I noticed that the unhealthy pod does not have a vlan device (e.g. vlan.eth.XY@eth1) on the host, and that the associated routes that would otherwise be in place are missing. There is a veth device for the broken pod (e.g. vlan7cf3df5c6e2@if3).

Exploring this further, I found the following in the CNI logs (for a broken pod):

{"level":"info","ts":"2021-09-24T22:52:24.230-0400","caller":"routed-eni-cni-plugin/cni.go:117","msg":"Received CNI add request: ContainerID(30c860ea0116032d24b70be358ce519ca613b2a449a2ffaafd31992c3245359c) Netns(/proc/20683/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=redacted;K8S_POD_NAME=redacted-bf4687b75-bsmsm;K8S_POD_INFRA_CONTAINER_ID=30c860ea0116032d24b70be358ce519ca613b2a449a2ffaafd31992c3245359c) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}

{"level":"info","ts":"2021-09-24T22:52:24.287-0400","caller":"routed-eni-cni-plugin/cni.go:117","msg":"Received add network response for container 30c860ea0116032d24b70be358ce519ca613b2a449a2ffaafd31992c3245359c interface eth0: Success:true IPv4Addr:\"100.64.67.106\" DeviceNumber:-1 VPCcidrs:\"10.12.0.0/17\" VPCcidrs:\"100.64.0.0/17\" VPCcidrs:\"100.64.128.0/18\" PodVlanId:28 PodENIMAC:\"0e:01:d3:36:cc:57\" PodENISubnetGW:\"100.64.64.1\" ParentIfIndex:3 "}

{"level":"info","ts":"2021-09-24T22:52:26.955-0400","caller":"routed-eni-cni-plugin/cni.go:249","msg":"Received CNI del request: ContainerID(a279fbfcc13e5c1f951b2b770b196f4cc4ac1630b80efc79707ca509d51e079a) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=redacted;K8S_POD_NAME=redacted-bf4687b75-bsmsm;K8S_POD_INFRA_CONTAINER_ID=a279fbfcc13e5c1f951b2b770b196f4cc4ac1630b80efc79707ca509d51e079a) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}

{"level":"info","ts":"2021-09-24T22:52:26.965-0400","caller":"routed-eni-cni-plugin/cni.go:249","msg":"Received del network response for pod redacted-bf4687b75-bsmsm namespace redacted sandbox a279fbfcc13e5c1f951b2b770b196f4cc4ac1630b80efc79707ca509d51e079a: Success:true IPv4Addr:\"100.64.67.106\" PodVlanId:28 "}

From this, my suspicion is that the pod sandbox (e.g. the pause container) changed (from a279fbfcc13e5c1f951b2b770b196f4cc4ac1630b80efc79707ca509d51e079a to 30c860ea0116032d24b70be358ce519ca613b2a449a2ffaafd31992c3245359c), and that the CNI deleted the interface configuration for the pod when it got the CNI del request for the a279fbfcc13e5c1f951b2b770b196f4cc4ac1630b80efc79707ca509d51e079a container, even though 30c860ea0116032d24b70be358ce519ca613b2a449a2ffaafd31992c3245359c was still running.

The kubelet logs for those container IDs are as follows:

Sep 24 22:52:08 ip-100-64-91-119.ec2.internal kubelet[5518]: W0924 22:52:08.303127    5518 helpers.go:291] Unable to create pod sandbox due to conflict. Attempting to remove sandbox "a279fbfcc13e5c1f951b2b770b196f4cc4ac1630b80efc79707ca509d51e079a"
Sep 24 22:52:08 ip-100-64-91-119.ec2.internal kubelet[5518]: E0924 22:52:08.303980    5518 helpers.go:296] Failed to remove the conflicting container "a279fbfcc13e5c1f951b2b770b196f4cc4ac1630b80efc79707ca509d51e079a": Error: No such container: a279fbfcc13e5c1f951b2b770b196f4cc4ac1630b80efc79707ca509d51e079a
Sep 24 22:52:26 ip-100-64-91-119.ec2.internal kubelet[5518]: W0924 22:52:26.628642    5518 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "redacted-bf4687b75-bsmsm_redacted": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "a279fbfcc13e5c1f951b2b770b196f4cc4ac1630b80efc79707ca509d51e079a"
Sep 24 22:52:26 ip-100-64-91-119.ec2.internal kubelet[5518]: I0924 22:52:26.639409    5518 kubelet.go:1899] SyncLoop (PLEG): "redacted-bf4687b75-bsmsm_redacted(012d1b8d-aaf5-432d-a45f-d2273aa8f6bc)", event: &pleg.PodLifecycleEvent{ID:"012d1b8d-aaf5-432d-a45f-d2273aa8f6bc", Type:"ContainerDied", Data:"30c860ea0116032d24b70be358ce519ca613b2a449a2ffaafd31992c3245359c"}
Sep 24 22:52:26 ip-100-64-91-119.ec2.internal kubelet[5518]: W0924 22:52:26.639534    5518 pod_container_deletor.go:79] Container "30c860ea0116032d24b70be358ce519ca613b2a449a2ffaafd31992c3245359c" not found in pod's containers
Sep 24 22:52:26 ip-100-64-91-119.ec2.internal kubelet[5518]: I0924 22:52:26.639560    5518 kubelet.go:1899] SyncLoop (PLEG): "redacted-bf4687b75-bsmsm_redacted(012d1b8d-aaf5-432d-a45f-d2273aa8f6bc)", event: &pleg.PodLifecycleEvent{ID:"012d1b8d-aaf5-432d-a45f-d2273aa8f6bc", Type:"ContainerDied", Data:"a279fbfcc13e5c1f951b2b770b196f4cc4ac1630b80efc79707ca509d51e079a"}
Sep 24 22:52:26 ip-100-64-91-119.ec2.internal kubelet[5518]: W0924 22:52:26.639781    5518 pod_container_deletor.go:79] Container "a279fbfcc13e5c1f951b2b770b196f4cc4ac1630b80efc79707ca509d51e079a" not found in pod's containers
Sep 24 22:52:26 ip-100-64-91-119.ec2.internal kubelet[5518]: W0924 22:52:26.930676    5518 cni.go:333] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "a279fbfcc13e5c1f951b2b770b196f4cc4ac1630b80efc79707ca509d51e079a"
Sep 24 22:52:28 ip-100-64-91-119.ec2.internal kubelet[5518]: I0924 22:52:28.378542    5518 kubelet.go:1899] SyncLoop (PLEG): "redacted-bf4687b75-bsmsm_redacted(012d1b8d-aaf5-432d-a45f-d2273aa8f6bc)", event: &pleg.PodLifecycleEvent{ID:"012d1b8d-aaf5-432d-a45f-d2273aa8f6bc", Type:"ContainerStarted", Data:"30c860ea0116032d24b70be358ce519ca613b2a449a2ffaafd31992c3245359c"}

I’m not sure what to make of these logs. I don’t imagine these failures are expected, but I do imagine they’re related.

Looking at the docker daemon logs for the same time, I see:

Sep 24 22:52:22 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:22.307003534-04:00" level=error msg="Handler for GET /v1.40/containers/74f181080a27d51e82642ed53520d02b6a193fc1c77eaf0732170cbf4d059df2/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
Sep 24 22:52:22 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:22.307310668-04:00" level=error msg="Handler for GET /v1.40/containers/74f181080a27d51e82642ed53520d02b6a193fc1c77eaf0732170cbf4d059df2/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
Sep 24 22:52:22 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:22.307317248-04:00" level=error msg="Handler for GET /v1.40/containers/74f181080a27d51e82642ed53520d02b6a193fc1c77eaf0732170cbf4d059df2/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
Sep 24 22:52:22 ip-100-64-91-119.ec2.internal dockerd[5396]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.WriteJSON (httputils_write_json.go:11)
Sep 24 22:52:22 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:22.307491880-04:00" level=error msg="Handler for GET /v1.40/containers/74f181080a27d51e82642ed53520d02b6a193fc1c77eaf0732170cbf4d059df2/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
Sep 24 22:52:22 ip-100-64-91-119.ec2.internal dockerd[5396]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.WriteJSON (httputils_write_json.go:11)
Sep 24 22:52:22 ip-100-64-91-119.ec2.internal dockerd[5396]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.WriteJSON (httputils_write_json.go:11)
Sep 24 22:52:22 ip-100-64-91-119.ec2.internal dockerd[5396]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.WriteJSON (httputils_write_json.go:11)
Sep 24 22:52:26 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:26.468708964-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 24 22:52:42 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:42.565667606-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 24 22:52:44 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:44.673231937-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 24 22:52:44 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:44.756816200-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 24 22:52:44 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:44.849625554-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 24 22:52:51 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:51.480661606-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 24 22:52:51 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:51.757306502-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 24 22:52:52 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:52.170274288-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 24 22:52:52 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:52.255186091-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 24 22:52:56 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:56.172602355-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 24 22:52:56 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:56.268824056-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 24 22:52:56 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:56.455027055-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 24 22:52:56 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:56.553398051-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 24 22:52:56 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:56.870321436-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 24 22:52:57 ip-100-64-91-119.ec2.internal dockerd[5396]: time="2021-09-24T22:52:57.166208893-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

I am not sure what to make of these either, but I do know that 74f181080a27d51e82642ed53520d02b6a193fc1c77eaf0732170cbf4d059df2 is the container ID of aws-vpc-cni .

My theory is that there’s some inconsistency between dockerd, the kubelet, and the CNI plugin, which results in the CNI deleting the vlan interface that’s needed by a running pod. I’m not sure what underlying condition triggers the inconsistency, but I imagine the CNI should do everything it can to avoid deleting an in-use interface (or at least replace it if it discovers it was deleted in error).

Attach logs

These logs are available in the AWS support case (see below).

What you expected to happen:

I expected pods to start with functional network connectivity.

How to reproduce it (as minimally and precisely as possible):

I do not have a reproducer at this time, but will attempt to devise one as time permits.

Anything else we need to know?:

AWS support case 8957900351 was opened for this issue.

Environment:

Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.20-eks-8c579e", GitCommit:"8c579edfc914f013ff48b2a2b2c1308fdcacc53f", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:13Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
CNI Version: 1.7.9
OS (e.g: cat /etc/os-release):

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

Kernel (e.g. uname -a): Linux ip-100-64-91-119.ec2.internal 5.10.59-52.142.amzn2.x86_64 #1 SMP Wed Aug 18 21:17:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 6
Comments: 28 (15 by maintainers)

Most upvoted comments

Hi @chlunde I was able to repro this issue with containerd as the runtime. It’s happening due to a spurious delete request triggered even though the first delete for a pod succeeds. For now I have updated the PR to ignore spurious delete requests. You can try to rebuild using the updated PR and see if it works. I am working on finding the root cause and will update you.

cgchinmay on Nov 8, 2021

@cgchinmay do you have a status update?

@chlunde By status update if you mean official public image release(v.1.10.2) for the above fix then its scheduled for 7th Feb. But we do have a private image released with the above fix (v.10.2-rc1) and have been asking Customers to try it out until then.

Installation instructions for private image : v1.10.2-rc1 Apply this manifest with the cni image tag changed to v1.10.2-rc1

cgchinmay on Jan 18, 2022

Logs sent. Yes, your sequence is correct.

chlunde on Nov 4, 2021

@JacobHenner as per @jayanthvn who is driving this repo, next version release is scheduled in two weeks and we should have RC image before that. We will work on getting this fixed in the next release. If anything slips will keep this thread updated.

Update: I was looking at few alternatives like skip deletion of Vlan for previous pod with same name if current pod doesn’t have DeletionTimestamp set. But it looks little hacky as there could be other race condn here. Another manual workaround which is hacky as well is adding finalizer to podspec and during SS updates, patch the pod metadata to remove the finalizer few seconds/minutes after Kubelet sets deletionGracePeriodSeconds to 0. But given that the team is actively working on file based solution, we will work keep this as our high priority item and update the issue when CNI is released.

We might be in code freeze for next release but we will cut one as soon as PR for this issue is merged.

SaranBalaji90 on Oct 7, 2021

You are right @chlunde terminationGracePeriod setting doesn’t help with this and yes I was talking about long term fix which team is actively looking into. I’m looking at other short-term fix to mitigate the impact. Will keep you posted here by EOD.

/cc @JacobHenner

SaranBalaji90 on Sep 30, 2021