cilium: CiliumEndpoint missing for a pod

cilium version: 1.9.4 kubelet: 1.20

After upgrade of a cluster, we notice ciliumendpoint is missing for a running pod. We dig into it and think it may be a bug in the way cilium manages cilium endpoint. Here is the timeline:

  1. An existing statefulset pod alertmanager-0 running on a node
  2. cluster is going with an upgrade. both cilium-agent and alertmanager-0 are getting upgraded
  3. kubelet fails to remove the old container for alertmanager-0 because cilium-agent is being rebooted at the same time (so cni is missing)
Aug 03 04:08:11 bmut-dozrr9-0803-024754-871f3-acp2 kubelet[48398]: E0803 04:08:11.853703   48398 pod_workers.go:191] Error syncing pod 3f7c5201-edc9-4b83-a6a2-470289ae89ac ("alertmanager-0_kube-system(3f7c5201-edc9-4b83-a6a2-470289ae89ac)"), skipping: error killing pod: failed to "KillPodSandbox" for "3f7c5201-edc9-4b83-a6a2-470289ae89ac" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"a580c1ab31c3b1c66cbc5d542f9eb340bd0494449c2d832c1bda6e27636ed949\": failed to find plugin \"cilium-cni\" in path [/opt/cni/bin]"
  1. cilium-agent is rebooted and starts to restore existing endpoints
2021-08-03T04:08:34.378283278Z level=info msg="New endpoint" containerID=a580c1ab31 datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=277 identity=60609 ipv4=192.168.1.165 ipv6= k8sPodName=kube-system/alertmanager-0 subsys=endpoint
  1. Almost at the same time, kubelet creates a new container for the pod:
2021-08-03T04:08:34.737428463Z level=info msg="Create endpoint request" addressing="&{192.168.1.85 78bda739-f410-11eb-92f1-42010a800063  }" containerID=f4a2e88f66aabf348196605c0b4d430bf6f3b4f660224f984bd16053b07b6b23 datapathConfiguration="<nil>" interface=lxc368fadda4150 k8sPodName=kube-system/alertmanager-0 labels="[]" subsys=daemon sync-build=true

Note that the container ID is not the same, so it’s kubelet creating a new container for the same pod

  1. after ~ 30seconds, kubelet tries to remove the old container of the pod
2021-08-03T04:09:00.533694827Z level=info msg="Delete endpoint request" id="container-id:a580c1ab31c3b1c66cbc5d542f9eb340bd0494449c2d832c1bda6e27636ed949" subsys=daemon
2021-08-03T04:09:00.534015683Z level=info msg="Releasing key" key="[k8s:app=alertmanager k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=alertmanager k8s:io.kubernetes.pod.namespace=kube-system k8s:statefulset.kubernetes.io/pod-name=alertmanager-0]" subsys=allocator
2021-08-03T04:09:00.539698529Z level=info msg="Removed endpoint" containerID=a580c1ab31 datapathPolicyRevision=1 desiredPolicyRevision=1 endpointID=277 identity=60609 ipv4=192.168.1.165 ipv6= k8sPodName=kube-system/alertmanager-0 subsys=endpoint

Note the container id in the request is the old one.

  1. cilium-agent removes the ciliumendpoint but still the pod is running.

After all of these steps, we end up with alertmanager-0 running fine but doesn’t have a cilium endpoint. I guess the issue is that when we process delete endpoint rpc request, we didn’t check the container id matches the one in the cilium_endpoint so if kubelet is removing an stale container, ciliumendpoint for a pod is removed. Shall we add this check?

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 7
  • Comments: 33 (32 by maintainers)

Most upvoted comments

@christarazi @Weil0ng We should only delete the CEP with the pre condition that its UID is the same when the CEP was created. Something like this:

diff --git a/pkg/k8s/watchers/endpointsynchronizer.go b/pkg/k8s/watchers/endpointsynchronizer.go
index 9e716c85b4..b4fe16a44e 100644
--- a/pkg/k8s/watchers/endpointsynchronizer.go
+++ b/pkg/k8s/watchers/endpointsynchronizer.go
@@ -177,6 +177,9 @@ func (epSync *EndpointSynchronizer) RunK8sCiliumEndpointSync(e *endpoint.Endpoin
                                                        return err
                                                }
 
+                                               // Store CEP UID
+                                               e.CEPUID = localCEP.UID
+
                                                // continue the execution so we update the endpoint
                                                // status immediately upon endpoint creation
                                        case err != nil:
@@ -324,8 +327,15 @@ func deleteCEP(ctx context.Context, scopedLog *logrus.Entry, ciliumClient v2.Cil
                scopedLog.Debug("Skipping CiliumEndpoint deletion because it has no k8s namespace")
                return nil
        }
-       if err := ciliumClient.CiliumEndpoints(namespace).Delete(ctx, podName, meta_v1.DeleteOptions{}); err != nil {
-               if !k8serrors.IsNotFound(err) {
+
+       err := ciliumClient.CiliumEndpoints(namespace).Delete(ctx, podName, meta_v1.DeleteOptions{
+               Preconditions: &meta_v1.Preconditions{
+                       UID: func() *types.UID { a := types.UID(e.CEPUID); return &a }(),
+               },
+       })
+
+       if err != nil {
+               if !k8serrors.IsNotFound(err) || !k8serrors.PreConditionFailed() {
                        scopedLog.WithError(err).Warning("Unable to delete CEP")
                }
        }

The CEP object is only removed because it has the ownerReference set to the backing pod, so that their lifecycles are tied together.

Basically, the timeline of events for an endpoint-delete is:

  • kubectl delete pod app1
  • kubelet sends CNI DEL to Cilium
  • Cilium calls DeleteEndpoint(), which deletes the endpoint internally (removes from BPF maps, etc)
  • app1 pod resource is deleted as the CNI DEL has completed
  • Cilium’s CEP watcher receives a CEP delete (because of ownerReference set to the now-deleted pod)
    • Deletes endpoint’s IP from ipcache, etc

@Weil0ng

I’m actually still a bit confused as why we would have two pods of same name in this case…as you mentioned k8s is assuming uniqueness of resource names, no? This sounds like a general k8s issue?

I think that’s at the core of my confusion. I think Cilium must follow whatever Kubernetes deems “uniqueness” is. If statefulsets are the exception, then Cilium must account for that.

@liuyuan10 The above is why I’m not saying to just go right ahead re: adding a check for containerID in the delete. I think the solution needs to dig one level deeper.

How does it verify the pod exists without connecting to k8s? Do you mean it restores first and then later verify with k8s and trash the endpoint?

Yes that’s correct.

Because the pod is quickly recreated by k8s while cilium is getting restarted, probably it won’t trash it because there is a pod with the same name.

Correct yeah it seems to me that the pod is back up by the time Cilium goes to validate the restored endpoint.

What’s in question to me is that why the CNI ADD in (5) didn’t add a CEP assuming the it’s already removed at that time.

I think because the CEP already exists with the pod name, because CEPs are named after pod names.

Curious in this issue, a bit confused by above chain of events, so when and who issues the DELETE for the CEP of “app1” to the apiserver before any watcher can get the delete event?

According to christarazi , I think it’s when the k8s pod is removed, apiserver removes the CEP as well because threre is a owner ref to k8s pod.