cilium: Extremely slow agent startup

Is there an existing issue for this?

I have searched the existing issues

What happened?

Rolling upgrade to v1.12.9
each cilium agent pod is not ready for 2-3m

Cilium Version

1.12.9 e0bb30a 2023-04-17T23:54:19+02:00 go version go1.18.10 linux/amd64

Kernel Version

Linux ip-10-200-14-243 5.15.0-1033-aws #37~20.04.1-Ubuntu SMP Fri Mar 17 11:39:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

v1.22.17-eks-48e63af

Sysdump

No response

Relevant log output

No response

Anything else?

When startup is slow, it correlates with cluster size (some clusters don’t have this issue)

For the time until cilium starts, all I get is:

# cilium status
Get "http:///var/run/cilium/cilium.sock/v1/healthz": dial unix /var/run/cilium/cilium.sock: connect: no such file or directory
Is the agent running?

I do see a bunch of clang and tc commands running in the background during this time.

Code of Conduct

I agree to follow this project’s Code of Conduct

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (15 by maintainers)

Commits related to this issue

ipcache: switch named ports to reference counting This commit introduces reference counting for named ports. Using the reference counting, we know when to add or remove named ports in our bookkeeping... — committed to bimmlerd/cilium by bimmlerd a year ago
ipcache: switch named ports to reference counting This commit introduces reference counting for named ports. Using the reference counting, we know when to add or remove named ports in our bookkeeping... — committed to cilium/cilium by bimmlerd a year ago
ipcache: switch named ports to reference counting [ upstream commit 33079de7fb6292efd4b837de2f696ee2edaeb8f4 ] This commit introduces reference counting for named ports. Using the reference counting... — committed to bimmlerd/cilium by bimmlerd a year ago
ipcache: switch named ports to reference counting [ upstream commit 33079de7fb6292efd4b837de2f696ee2edaeb8f4 ] [ backporter's notes: We don't have the luxury of generics here, hence instead of using... — committed to bimmlerd/cilium by bimmlerd a year ago
ipcache: switch named ports to reference counting [ upstream commit 33079de7fb6292efd4b837de2f696ee2edaeb8f4 ] This commit introduces reference counting for named ports. Using the reference counting... — committed to cilium/cilium by bimmlerd a year ago
ipcache: switch named ports to reference counting [ upstream commit 33079de7fb6292efd4b837de2f696ee2edaeb8f4 ] [ backporter's notes: We don't have the luxury of generics here, hence instead of using... — committed to cilium/cilium by bimmlerd a year ago

Most upvoted comments

@dctrwatson Could you give us an indication of the cluster size? How many pods do you have?

The cluster where that pprof was taken has: ~9k endpoints ~7k identities ~2k network policies ~7k services ~10k pods ~200 nodes

dctrwatson on Apr 20, 2023