linkerd2: Controller CrashLoopBackoff in 2.5.0 with more than 3K pods
Bug Report
What is the issue?
We updated from Linkerd 2.4.0 to Linkerd 2.5.0. After upgrading the controller pods started crashlooping and the linkerd proxy errored with “too many in-flight messages”. We have a quite large setup with 50 nodes in our cluster, but 3400 Pods running, most of them meshed.
We also noticed that Linkerd Prometheus crashlooped. First we thought this was due to resource limits, so we increase them from 8Gi to 48Gi. It consumed a fair amount of memory and then stopped crashing at around 28Gi. However the Controller Pods still crashlooped, apparantely the destination
container was crashing frequently.
How can it be reproduced?
Upgrade our cluster to 2.5.0… 😉
Logs, error output, etc
(If the output is long, please create a gist and paste the link here.)
Environment
- Kubernetes Version: 1.15.2
- Cluster Environment: (GKE, AKS, kops, …): bare-metal
- Host OS: ContainerLinux
- Linkerd version: 2.4.0->2.5.0
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (15 by maintainers)
@cpretzer yes, i can confirm