linkerd2: Controller CrashLoopBackoff in 2.5.0 with more than 3K pods

Bug Report

What is the issue?

We updated from Linkerd 2.4.0 to Linkerd 2.5.0. After upgrading the controller pods started crashlooping and the linkerd proxy errored with “too many in-flight messages”. We have a quite large setup with 50 nodes in our cluster, but 3400 Pods running, most of them meshed.

We also noticed that Linkerd Prometheus crashlooped. First we thought this was due to resource limits, so we increase them from 8Gi to 48Gi. It consumed a fair amount of memory and then stopped crashing at around 28Gi. However the Controller Pods still crashlooped, apparantely the destination container was crashing frequently.

How can it be reproduced?

Upgrade our cluster to 2.5.0… 😉

Logs, error output, etc

(If the output is long, please create a gist and paste the link here.)

Environment

Kubernetes Version: 1.15.2
Cluster Environment: (GKE, AKS, kops, …): bare-metal
Host OS: ContainerLinux
Linkerd version: 2.4.0->2.5.0

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 15 (15 by maintainers)

Most upvoted comments

@cpretzer yes, i can confirm

christianhuening on Sep 20, 2019