linkerd2: Outbound requests for some downstream services consistently fail after the proxy reported "panicked at 'cancel sender lost'"

Discussed in https://github.com/linkerd/linkerd2/discussions/8641

^{Originally posted by DarkWalker June 9, 2022}

What’s the issue

linkerd-proxy goes into a “broken” state after which all requests to a specific downstream service within the mesh fail with either INTERNAL or CANCELLED error (observed by the caller). Requests never make it to the target service.

Details about the setup

there are a total of 16 services in the mesh with 2 to 60 pods each, the total number of pods in the mesh is 278
all the services use grpc; LinkerD is used to to encrypt the traffic, other than that there is no encryption within the cluster
we recently migrated from 2.10 to 2.11.2 and this seems to be related, but there are other things in the cluster that changed recently, so we rolled everything back and trying to introduce changes one by one, we will have the same setup on 2.10 in about a week and then I will be able to tell for sure whenever it’s 2.11.2 related or if we see a similar issue on 2.10

How the problem manifests itself

only some of the services within the mesh are affected and this seems to be pretty consistent across all the incidents (we had 3 so far); on top of it only certain pods are affected (e.g. for one service 3 out of 3 pods were affected, but for another 9 out of 12 were affected during first incident, then 3 out of 12 for the second one and then 5/12 for the 3rd)
when pods go into a bad state they do not recover automatically - one of the affected services was not critical, so I left 2 out of 3 pods in a bad state and they stayed in the bad state for more than 24 hours before we rolled back linkerd and had to restart all the deployments
the mitigation is to delete the pod (and then deployment recreates it), the new pods that come up work just fine
errors observed by grpc interceptor over the client report CANCELLED and INTERNAL errors
this happens only in production, we did a fair amount of load testing prior to the upgrade as well as deliberately trying to repro the issue after it happened the first time - but no repro so far outside of production load

What we see in linkerd-proxy container logs of affected pods

first, there is an error thread 'main' panicked at 'cancel sender lost', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.12/src/ready_cache/cache.rs:415:13 - this message is printed once when pod goes into the bad state
then following the error above there is one of two errors for every outgoing request from this pod:

either INFO ThreadId(01) outbound:server{orig_dst=10.0.0.1:80}:rescue{client.addr=172.16.0.1:35024}: linkerd_app_core::errors::respond: Request failed error=buffer's worker closed unexpectedly, or
this one WARN ThreadId(01) outbound:server{orig_dst=10.0.0.1:80}:rescue{client.addr=172.16.0.1:49336}: linkerd_app_outbound::http::server: Unexpected error error=buffered service failed: panic.

buffer's worker closed unexpectedly is present in kube-proxy logs of unaffected pods which function just fine, but cancel sender lost only appears on the pods when they go into the bad state

Observations

Right before it happens, each time there are multiple instances of the downstream service start failing readiness probe (intentionally, this is how we control pods getting out of rotation when they get to the point of perf degradation due to large amount of traffic) - so it seems like right after a bunch of pods start failing readiness probe - linkerd-proxy on certain pods for services that depend on this one go into bad state and does not self-heal after everything is back to reporting “ready”.
I can’t be sure whenever there was a moment where all the replicas of that service failed the readiness check or not (so it’s possible there was a brief period of downtime, not more than 10 seconds, where there was not a single replica available for the service; it may also be true that any given moment there was at least one replica available - our metrics have a resolution of 30 seconds and all client calls are retried, so the best conclusion I can make is that the total downtime, if it happened, was no longer than 10 seconds)
Out of 7 consumers only 3 were affected - and this has been consistent across all 3 incidents. Non-affected services were making either around 40,000 RPS (with 1,000 RPS from every replica for one service and 3,000 RSP from every replica on another one) or din’t make any calls at all when the incidents happened (but made calls less than a minute after and those calls worked fine), affected services were making around 50 RPS (with ~4 RPS from every replica)
This issue seems to be similar to https://github.com/linkerd/linkerd2/issues/6086

Attached are logs from running curl -v --data 'linkerd=trace,tower=trace' -X PUT localhost:28014/proxy-log-level between two buffer's worker closed unexpectedly which are followed by Handling error on gRPC connection code=Internal error, which seems to be what we observe on the application side (the IP addresses as well as some bits are redacted for security purposes, please let me know if it’s a problem).

linkerd_proxy_trace.txt

I would appreciate any help in debugging this issue or tips on how to workaround/repro this in an isolated environment.

</div>

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 16 (8 by maintainers)

Commits related to this issue

update Tower to 0.4.13 to fix load balancer panic Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower... — committed to linkerd/linkerd2-proxy by hawkw 2 years ago
update Tower to 0.4.13 to fix load balancer panic (#1758) Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally... — committed to linkerd/linkerd2-proxy by hawkw 2 years ago
update Tower to 0.4.13 to fix load balancer panic (#1758) Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally ... — committed to linkerd/linkerd2-proxy by hawkw 2 years ago
update Tower to 0.4.13 to fix load balancer panic (#1758) Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally ... — committed to linkerd/linkerd2-proxy by hawkw 2 years ago
update Tower to 0.4.13 to fix load balancer panic (#1758) Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally ... — committed to linkerd/linkerd2-proxy by hawkw 2 years ago
update Tower to 0.4.13 to fix load balancer panic (#1758) Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally ... — committed to linkerd/linkerd2-proxy by hawkw 2 years ago

Most upvoted comments

I’ve applied these annotations:

  config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
  config.linkerd.io/proxy-version: ver.tower-rc-fix.b296154c

…and for the past hour I’ve not seen any occurrences of either of these log entries:

thread 'main' panicked at 'cancel sender lost', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.12/src/ready_cache/cache.rs:415:13

linkerd_app_core::serve: Server failed to become ready error=buffered service failed: buffered service failed: task 28182 panicked error.sources=[buffered service failed: task 28182 panicked, task 28182 panicked] client.addr=10.244.3.95:49580

I will continue to monitor for 24 hours and report back, but so far this patch version is looking very promising.

johnswarbrick on Jun 17, 2022

We have a isolated a similar repro and fixed it in https://github.com/tower-rs/tower/pull/668 https://github.com/linkerd/linkerd2-proxy/pull/1753. The fix can be tested with:

annotations:
  config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
  config.linkerd.io/proxy-version: ver.tower-rc-fix.b296154c

olix0r on Jun 17, 2022

Attached logfiles in debug level.

linkerd-debug-log.txt

johnswarbrick on Jun 15, 2022

Hi @hawkw. We are using Azure AKS. Standard_D8ads_v5 running AKSUbuntu-1804gen2containerd-2022.04.13.

johnswarbrick on Jun 15, 2022