kubernetes: Delete pod didn't call CNI "DEL" causing CNI IP leak

What happened:

Some issue causing fail when initializing Pod, however CNI IP was never recycled by running “DEL”.

Eventually new pods stuck due to no more address to assign by Azure CNI.

We identified one pod which has it’s pod name in azure-vnet.json (which saves state of each IP), but not present anymore in the clsuter, the following is the related logs in /var/log/messages of it’s lifecycle from node. (log has removed node name to avoid to be tediously long)

16:16:38 kubelet[3898]: I1125 16:16:38.724199    3898 kubelet.go:1887] SyncLoop (ADD, "api"): "lifestyle-contentservice-api-deployment-green-74bc4ccf86-r9pfn_lifestyle(f733ccfa-0f9e-11ea-af5e-467e88bd3364)"
16:16:38 kubelet[3898]: I1125 16:16:38.861325    3898 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-bj62h" (UniqueName: "kubernetes.io/secret/f733ccfa-0f9e-11ea-af5e-467e88bd3364-default-token-bj62h") pod "lifestyle-contentservice-api-deployment-green-74bc4ccf86-r9pfn" (UID: "f733ccfa-0f9e-11ea-af5e-467e88bd3364")
16:16:38 kubelet[3898]: I1125 16:16:38.961694    3898 reconciler.go:252] operationExecutor.MountVolume started for volume "default-token-bj62h" (UniqueName: "kubernetes.io/secret/f733ccfa-0f9e-11ea-af5e-467e88bd3364-default-token-bj62h") pod "lifestyle-contentservice-api-deployment-green-74bc4ccf86-r9pfn" (UID: "f733ccfa-0f9e-11ea-af5e-467e88bd3364")
16:16:38 systemd[1]: Started Kubernetes transient mount for /var/lib/kubelet/pods/f733ccfa-0f9e-11ea-af5e-467e88bd3364/volumes/kubernetes.io~secret/default-token-bj62h.
16:16:38 kubelet[3898]: I1125 16:16:38.969572    3898 operation_generator.go:669] MountVolume.SetUp succeeded for volume "default-token-bj62h" (UniqueName: "kubernetes.io/secret/f733ccfa-0f9e-11ea-af5e-467e88bd3364-default-token-bj62h") pod "lifestyle-contentservice-api-deployment-green-74bc4ccf86-r9pfn" (UID: "f733ccfa-0f9e-11ea-af5e-467e88bd3364")
16:16:39 kubelet[3898]: I1125 16:16:39.078573    3898 kuberuntime_manager.go:409] No sandbox for pod "lifestyle-contentservice-api-deployment-green-74bc4ccf86-r9pfn_lifestyle(f733ccfa-0f9e-11ea-af5e-467e88bd3364)" can be found. Need to start a new one
16:16:39 kubelet[3898]: I1125 16:16:39.337038    3898 kubelet.go:1903] SyncLoop (DELETE, "api"): "lifestyle-contentservice-api-deployment-green-74bc4ccf86-r9pfn_lifestyle(f733ccfa-0f9e-11ea-af5e-467e88bd3364)"
16:16:39 kubelet[3898]: I1125 16:16:39.465833    3898 reconciler.go:181] operationExecutor.UnmountVolume started for volume "default-token-bj62h" (UniqueName: "kubernetes.io/secret/f733ccfa-0f9e-11ea-af5e-467e88bd3364-default-token-bj62h") pod "f733ccfa-0f9e-11ea-af5e-467e88bd3364" (UID: "f733ccfa-0f9e-11ea-af5e-467e88bd3364")
16:16:39 kubelet[3898]: I1125 16:16:39.486585    3898 operation_generator.go:815] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/f733ccfa-0f9e-11ea-af5e-467e88bd3364-default-token-bj62h" (OuterVolumeSpecName: "default-token-bj62h") pod "f733ccfa-0f9e-11ea-af5e-467e88bd3364" (UID: "f733ccfa-0f9e-11ea-af5e-467e88bd3364"). InnerVolumeSpecName "default-token-bj62h". PluginName "kubernetes.io/secret", VolumeGidValue ""
16:16:39 kubelet[3898]: I1125 16:16:39.566107    3898 reconciler.go:301] Volume detached for volume "default-token-bj62h" (UniqueName: "kubernetes.io/secret/f733ccfa-0f9e-11ea-af5e-467e88bd3364-default-token-bj62h") on node "aks-agentpool-33856301-vmss000002" DevicePath ""
16:16:39 dockerd[3529]: time="2019-11-25T16:16:39.643686400Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/3cb3517d1edbc101ec794e809dcfa90222d93d731385becd005a78b49383d6f6/shim.sock" debug=false pid=108918
16:16:40 kubelet[3898]: 2019/11/25 16:16:40 [109066] [cni-ipam] Processing ADD command with args {ContainerID:3cb3517d1edbc101ec794e809dcfa90222d93d731385becd005a78b49383d6f6 Netns:/proc/108942/ns/net IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=lifestyle;K8S_POD_NAME=lifestyle-contentservice-api-deployment-green-74bc4ccf86-r9pfn;K8S_POD_INFRA_CONTAINER_ID=3cb3517d1edbc101ec794e809dcfa90222d93d731385becd005a78b49383d6f6 Path:/opt/cni/bin}.
16:16:40 dockerd[3529]: time="2019-11-25T16:16:40.358203900Z" level=info msg="shim reaped" id=3cb3517d1edbc101ec794e809dcfa90222d93d731385becd005a78b49383d6f6
16:16:40 kubelet[3898]: W1125 16:16:40.839920    3898 docker_sandbox.go:384] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "lifestyle-contentservice-api-deployment-green-74bc4ccf86-r9pfn_lifestyle": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "3cb3517d1edbc101ec794e809dcfa90222d93d731385becd005a78b49383d6f6"
16:16:40 kubelet[3898]: W1125 16:16:40.869931    3898 docker_sandbox.go:384] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "lifestyle-contentservice-api-deployment-green-74bc4ccf86-r9pfn_lifestyle": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "3cb3517d1edbc101ec794e809dcfa90222d93d731385becd005a78b49383d6f6"
16:16:40 kubelet[3898]: I1125 16:16:40.871787    3898 kubelet.go:1932] SyncLoop (PLEG): "lifestyle-contentservice-api-deployment-green-74bc4ccf86-r9pfn_lifestyle(f733ccfa-0f9e-11ea-af5e-467e88bd3364)", event: &pleg.PodLifecycleEvent{ID:"f733ccfa-0f9e-11ea-af5e-467e88bd3364", Type:"ContainerStarted", Data:"3cb3517d1edbc101ec794e809dcfa90222d93d731385becd005a78b49383d6f6"}
16:16:40 kubelet[3898]: I1125 16:16:40.905369    3898 kubelet.go:1903] SyncLoop (DELETE, "api"): "lifestyle-contentservice-api-deployment-green-74bc4ccf86-r9pfn_lifestyle(f733ccfa-0f9e-11ea-af5e-467e88bd3364)"
16:16:40 kubelet[3898]: I1125 16:16:40.914246    3898 kubelet.go:1897] SyncLoop (REMOVE, "api"): "lifestyle-contentservice-api-deployment-green-74bc4ccf86-r9pfn_lifestyle(f733ccfa-0f9e-11ea-af5e-467e88bd3364)"
16:16:40 kubelet[3898]: I1125 16:16:40.914295    3898 kubelet.go:2100] Failed to delete pod "lifestyle-contentservice-api-deployment-green-74bc4ccf86-r9pfn_lifestyle(f733ccfa-0f9e-11ea-af5e-467e88bd3364)", err: pod not found
16:16:40 kubelet[3898]: E1125 16:16:40.921270    3898 pod_workers.go:190] Error syncing pod f733ccfa-0f9e-11ea-af5e-467e88bd3364 ("lifestyle-contentservice-api-deployment-green-74bc4ccf86-r9pfn_lifestyle(f733ccfa-0f9e-11ea-af5e-467e88bd3364)"), skipping: failed to "StartContainer" for "lifestyle-contentservice-api" with CreateContainerConfigError: "object \"lifestyle\"/\"secret-lifestyle-contentservice-api\" not registered"
16:16:41 kubelet[3898]: W1125 16:16:41.887112    3898 pod_container_deletor.go:75] Container "3cb3517d1edbc101ec794e809dcfa90222d93d731385becd005a78b49383d6f6" not found in pod's containers

We can see the following in the log but not sure how to interpret on which is the cause and which is part of symptom.

  • The pod has been requested to delete immediately after it’s start.
  • A volume for default token secret is unmounted due to the delete.
  • Container-shim was started, however immediately failed , CNI “ADD” called.
  • Not able to get IP for the container due to container terminated.
  • We can see there is error in StartContainer, mentioning not able to mount secret volume. (Is it causing shim-reaped?)
  • pod not found might be benign because of cache
  • not found in pod's containers doesn’t seem to be in the same code path that lead to TearDownPod

Background: current code path leads to CNI DEL

  • dockershim/docker_sandpox.go StopPodSandbox -> CNI.TearDownPod
  • kubelet_pods.go kl.podKillingCh-> podKiller -> killPod -> kuberuntime/kuberuntime_manager.go KillPod
  • kubelet.go syncLoop-> syncLoopIteration -> HandlePodRemoves -> deletePod -> kl.podKillingCh

What you expected to happen:

Even if pod sandbox fails due to some issue, we expect it to be correctly recycled and call TearDownPod.

How to reproduce it (as minimally and precisely as possible): Not yet clear, need directions to see possible causes to proceed targeted reproduction.

Anything else we need to know?: #34278 seems to be a historical issue of IP leaking.

Environment:

  • Kubernetes version (use kubectl version): 1.14.8
  • Cloud provider or hardware configuration: AKS
  • OS (e.g: cat /etc/os-release): Ubuntu 16.04
  • Kernel (e.g. uname -a): 4.15.0-1060-azure (buildd@lgw01-amd64-055) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)) #65-Ubuntu SMP Wed Sep 18 08:55:51 UTC 2019
  • Install tools:
  • Network plugin and version (if this is a network-related bug): Plugin azure-vnet-ipam version v1.0.27
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 25 (11 by maintainers)

Most upvoted comments

We are also facing the same issue. The issue is caused by a race condition between the container GC goroutine and the pod sandbox creation goroutine. We can reproduce it by adding an artificial delay in the cni plugin’s cmdAdd(which gets called as part of pod sandbox creation) so that the container GC goroutine gets invoked during the pod sandbox creation and deletes the pod sandbox that is being created.

Steps to reproduce

  1. Modify the cni plugin code to do a sleep of time greater than containerGcPeriod as the last thing in the cmdAdd function and deploy it to a k8s node
  2. create a pod that gets scheduled on the k8s node
  3. delete the pod immediately

You would notice that for the new pod that was created, cni plugin’s cmdAdd would’ve been called. Whereas cmdDel wouldn’t have been called even though the pod is terminated. Here are the relevant logs

Jan 22 12:43:09 test-kube-worker-1611049 kubelet[4747]: I0122 12:43:09.331432    4747 config.go:412] Receiving a new pod "nginx-66ddf67c7c-bwt77_default(b99621cb-6d27-4d1b-8726-4b952a9cd59a)"
Jan 22 12:43:09 test-kube-worker-1611049 kubelet[4747]: I0122 12:43:09.331498    4747 kubelet.go:1884] SyncLoop (ADD, "api"): "nginx-66ddf67c7c-bwt77_default(b99621cb-6d27-4d1b-8726-4b952a9cd59a)"
Jan 22 12:43:09 test-kube-worker-1611049 kubelet[4747]: I0122 12:43:09.665754    4747 kuberuntime_manager.go:682] Creating sandbox for pod "nginx-66ddf67c7c-bwt77_default(b99621cb-6d27-4d1b-8726-4b952a9cd59a)"
Jan 22 12:43:10 test-kube-worker-1611049 kubelet[4747]: I0122 12:43:10.113115    4747 kubelet.go:1900] SyncLoop (DELETE, "api"): "nginx-66ddf67c7c-bwt77_default(b99621cb-6d27-4d1b-8726-4b952a9cd59a)"
Jan 22 12:43:10 test-kube-worker-1611049 kubelet[4747]: I0122 12:43:10.395176    4747 plugins.go:380] Calling network plugin cni to set up pod "nginx-66ddf67c7c-bwt77_default"
Jan 22 12:43:30 test-kube-worker-1611049 kubelet[4747]: I0122 12:43:30.817684    4747 kuberuntime_gc.go:165] Removing sandbox "0c73286c847e8dd0929c6daf8641af4690e149bf1f6241e68ccc173d608bd45f"
Jan 22 12:44:20 test-kube-worker-1611049 kubelet[4747]: I0122 12:44:20.787546    4747 kuberuntime_manager.go:696] Created PodSandbox "0c73286c847e8dd0929c6daf8641af4690e149bf1f6241e68ccc173d608bd45f" for pod "nginx-66ddf67c7c-bwt77_default(b99621cb-6d27-4d1b-8726-4b952a9cd59a)"
Jan 22 12:44:20 test-kube-worker-1611049 kubelet[4747]: E0122 12:44:20.788420    4747 kuberuntime_manager.go:705] Failed to get pod sandbox status: rpc error: code = Unknown desc = Error: No such container: 0c73286c847e8dd0929c6daf8641af4690e149bf1f6241e68ccc173d608bd45f; Skipping pod "nginx-66ddf67c7c-bwt77_default(b99621cb-6d27-4d1b-8726-4b952a9cd59a)"

I went through the source code a bit to understand what is happening. Just adding my findings here in case it helps others. We are running k8s 1.15.0. I’ll give references to code based on v1.15.0.

Pod Sync Loop (optional implementation detail. can be skipped)

There is a PodSync Loop (kubelet.go:1871) which listens on a updates channel(podConfig.updates). There is a map of sources to channels stored in podConfig.Mux.Sources. For each source in this map, a goroutine is spawned that listens on the corresponding channel and calls podStorage.Merge. The podStorage.Merge splits the updates received into adds, updates, deletes, removes and then pushes them to the updates channel that the PodSync Loop is listening on. One of the sources is API server pkg/kubelet/config/apiserver.go which listens for updates to pods from all namespaces scheduled on the node in which kubelet is running and pushes these updates to the corresponding channel in podConfig.Mux.Sources. These updates from the API server are generally a list of pods that are then converted to updates of types add, delete, remove or update by comparing them to the locally stored state by podStorage.Merge before pushing them into the updates channel on which PodSync Loop is listening on.

Pod Addition Flow

Now let’s look at the new pod addition flow. The PodSync Loop receives an update on the updates channel of type kubetypes.ADD.

  1. case kubetypes.Add executes inside PodSync Loop, i.e., kubelet.syncLoopIteration function
  2. kubelet.HandlePodAdditions:2058
  3. kubelet.dispatchWork:1994
  4. podWorkers.UpdatePod:221. This part is executed as a separate goroutine that listens for updates on the channel passed as an argument.
  5. The actual pod update for the new pod creation is pushed to the updates channel in pod_workers.go:226.

ManagedPodLoop - Runs as a separate goroutine as mentioned in step 4.

  1. Inside pod_workers.managedPodLoop, podworkers.syncPodFn is called which is nothing but kubelet.syncPod
  2. A detailed explanation of what happens inside kubelet.syncPod can be found in this excellent article what happens when k8s. The relevant part for us would be calling CRI’s RunPodSandbox which in turn calls CNI’s cmdAdd. After this, the init containers followed by actual containers are created and started.
  3. In our case the CRI is docker_shim. When dockerService.RunPodSandbox executes successfully without any errors(setting up of the pause container, network namespace and calling cni.cmdAdd), dockerService.setNetworkReady is invoked to mark the pod sandbox’s network as ready.

Note An important thing to note that once the actual pod creation is handed over to the Managed Pod Loop, the PodSync Loop can continue to process additional updates from the channel on which it is listening on while the work on pod creation continues asynchronously in podWorkers.managedPodLoop Because of this when the pod is marked as deleted in the API server before the asynchronous managedPodLoop finishes creating the pod, the pod deletion flow is started parallelly in the PodSync Loop.

Pod Deletion Flow

Once the asynchronous managedPodLoop starts handling the pod creation, any immediate pod delete sent to the API server will be handled by the PodSync Loop. As part of pod deletion AFAIK, only the pod status is updated and later the container GC goroutine deletes the containers and sandbox marked as deleted in the status manager(Haven’t gone deep into this code flow). In our case, the cni delete that was supposed to be invoked as part of the container GC goroutine did not get invoked since dockerService.networkReady[podSandboxId] was set to false. This is set to true as part of the asynchronous managedPodLoop Step 3. So effectively the container GC goroutine deleted the pod sandbox(the pause container) without calling the cni delete causing the IP to be leaked.

The Bug

Now the question is why does the container GC goroutine pick up the pod for which cni Add has not been completed yet. Because of this, it wrongly assumes that cni delete doesn’t have to be called. Here is the comment from the evictSandboxes function that describes the conditions on which a pod is eligible to be evicted by the container GC goroutine.

evictSandboxes remove all evictable sandboxes. An evictable sandbox must meet the following requirements:

  1. not in ready state
  2. contains no containers.
  3. belong to a non-existent (i.e., already removed) pod, or is not the most recently created sandbox for the pod.

This bug IMO is because of this race condition between the container creation flow, container delete flow, and the container GC flow and probably would be found in all cni plugins. In our case, we use a modified version of the bridge cni plugin with a custom IPAM plugin.

I’m tagging the devs who have worked on the container GC code to get their thought on this bug - @Random-Liu, @feiskyer, @dashpole

I’m working on a fix for this issue and testing it out in our clusters. I’ve tested it in our clusters and it seems to work. The fix uses a mutex to avoid the race condition between the create pod(cni cmdAdd) and delete pod(cni cmdDel) flow in dockershim. I’ll raise a PR once I run the integration tests. Is anyone aware if it would be possible to add tests for race conditions? If so can you guys point me to some example or docs so that I can add a relevant test case for this.