cilium: Cilium agent logs complain about `Unable to fetch kubernetes labels`
Affected versions:
- Cilium 1.7.x
- Potentially later versions, not yet observed.
At this stage, only observed in CI (https://github.com/cilium/cilium/issues/10442).
Symptoms
Always observed
-
Every 15s or so, the cilium-agent log prints a warning about
Unable to fetch kubernetes labels:level=warning msg="Unable to fetch kubernetes labels" containerID=015ef4fdbf datapathPolicyRevision=2 desiredPolicyRevision=1 endpointID=1596 error="pod.core \"app3-c6c587577-fq6kr\" not found" identity=5 ipv4=10.10.0.197 ipv6="f00d::a0a:0:0:aeec" k8sPodName=default/app3-c6c587577-fq6kr subsys=resolve-labels-default/app3-c6c587577-fq6kr -
cilium statusreports failing controllers withresolve-labels-xxx:Failed controllers: controller resolve-labels-default/app1-68cb4f68c5-cftnr failure 'pod.core "app1-68cb4f68c5-cftnr" not found' -
There is no corresponding pod for these endpoints.
-
Endpoints show up in the cilium endpoint list with the identity
reserved:init, and they never get a proper identity:cmd: kubectl exec -n kube-system cilium-56pmk -- cilium endpoint list Exitcode: 0 Stdout: ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS ENFORCEMENT ENFORCEMENT 898 Enabled Enabled 5 reserved:init f00d::a0a:0:0:af1 10.10.0.151 ready 1479 Enabled Enabled 5 reserved:init f00d::a0a:0:0:969e 10.10.0.149 ready
May also be observed
- Cannot deploy new Kubernetes pods. Pod deploy logs report IPAM exhaustion.
Impact
~There is no known impact other than the failure logs potentially consuming disk space.~ The related application pods were already deleted so there is no traffic impact. UPDATE 2020-02-14: One impact if pods continue to be deployed to this node while the node is in this state, is that each “phantom” endpoint consumes an IP address. If the total number of “phantom” endpoints reaches the size of the podCIDR allocated to the node, then IPAM will begin to fail and subsequent pod deployments onto the node may fail until user action is taken to mitigate.
Migitation
Restarting the cilium-agent should cause Cilium to reevaluate the existence of this endpoint and clean up.
If you have more details on this issue, for example because you observe it in a real Cilium deployment, please post details below and react to this issue with 👍 so we know how widely it afffects the community.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 10
- Comments: 27 (10 by maintainers)
This issue has not seen any activity since it was marked stale. Closing.
I’ve reopened the issue since it seems like community users are still observing this.
As for next steps, I think we need a short set of reliable steps to reproduce the issue. It’s a bit tricky to understand how Cilium manages to get into this state in the first place at the moment. Is it due to use of k8s jobs or batches? Or are there other common steps that trigger this behaviour? It may also help if someone can upload a sysdump from a cluster where the problem is actively occurring.
@joestringer This still occurs, could we re open this?
Additional Info
This issue has not seen any activity since it was marked stale. Closing.
We will get a sysdump next time this triggers
I just ran into this too with Cilium 1.9.6 and Kubernetes 1.19.10 too.
The ‘phantom’ pods were all previously-deleted jobs.