cilium: Cilium agent logs complain about `Unable to fetch kubernetes labels`

Affected versions:

Cilium 1.7.x
Potentially later versions, not yet observed.

At this stage, only observed in CI (https://github.com/cilium/cilium/issues/10442).

Symptoms

Always observed

Every 15s or so, the cilium-agent log prints a warning about Unable to fetch kubernetes labels:

level=warning msg="Unable to fetch kubernetes labels" containerID=015ef4fdbf datapathPolicyRevision=2 desiredPolicyRevision=1 endpointID=1596 error="pod.core \"app3-c6c587577-fq6kr\" not found" identity=5 ipv4=10.10.0.197 ipv6="f00d::a0a:0:0:aeec" k8sPodName=default/app3-c6c587577-fq6kr subsys=resolve-labels-default/app3-c6c587577-fq6kr

cilium status reports failing controllers with resolve-labels-xxx:

Failed controllers:
 controller resolve-labels-default/app1-68cb4f68c5-cftnr failure 'pod.core "app1-68cb4f68c5-cftnr" not found'

There is no corresponding pod for these endpoints.

Endpoints show up in the cilium endpoint list with the identity reserved:init, and they never get a proper identity:

cmd: kubectl exec -n kube-system cilium-56pmk -- cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                            
IPv6                 IPv4          STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                                                            
	 898        Enabled            Enabled           5          reserved:init                                          f00d::a0a:0:0:af1    10.10.0.151   ready   
	 1479       Enabled            Enabled           5          reserved:init                                          f00d::a0a:0:0:969e   10.10.0.149   ready

May also be observed

Cannot deploy new Kubernetes pods. Pod deploy logs report IPAM exhaustion.

Impact

~There is no known impact other than the failure logs potentially consuming disk space.~ The related application pods were already deleted so there is no traffic impact. UPDATE 2020-02-14: One impact if pods continue to be deployed to this node while the node is in this state, is that each “phantom” endpoint consumes an IP address. If the total number of “phantom” endpoints reaches the size of the podCIDR allocated to the node, then IPAM will begin to fail and subsequent pod deployments onto the node may fail until user action is taken to mitigate.

Migitation

Restarting the cilium-agent should cause Cilium to reevaluate the existence of this endpoint and clean up.

If you have more details on this issue, for example because you observe it in a real Cilium deployment, please post details below and react to this issue with 👍 so we know how widely it afffects the community.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 10
Comments: 27 (10 by maintainers)

Most upvoted comments

This issue has not seen any activity since it was marked stale. Closing.

stale[bot] on Jun 26, 2021

I’ve reopened the issue since it seems like community users are still observing this.

As for next steps, I think we need a short set of reliable steps to reproduce the issue. It’s a bit tricky to understand how Cilium manages to get into this state in the first place at the moment. Is it due to use of k8s jobs or batches? Or are there other common steps that trigger this behaviour? It may also help if someone can upload a sysdump from a cluster where the problem is actively occurring.

joestringer on Aug 31, 2021

@joestringer This still occurs, could we re open this?

Additional Info

Cilium version: v1.9.8
Kubernetes version: v1.19.10

ulfox on Aug 27, 2021

This issue has not seen any activity since it was marked stale. Closing.

github-actions[bot] on Nov 14, 2023

We will get a sysdump next time this triggers

ulfox on Sep 6, 2021

I just ran into this too with Cilium 1.9.6 and Kubernetes 1.19.10 too.

The ‘phantom’ pods were all previously-deleted jobs.

mnaser on Aug 31, 2021