linkerd2: Outbound requests fail intermittently after the proxy reported "panicked at 'cancel sender lost'"
Bug Report
What is the issue?
The outbound requests of a meshed pod fail intermittently after its linkerd-proxy reported “panicked at ‘cancel sender lost’”.
We are not sure what triggers the issue. From the logs we can tell that after the linkerd-proxy emits the following:
thread 'main' panicked at 'cancel sender lost', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.6/src/ready_cache/cache.rs:397:13
Then around 50% of the outbound requests starts failing intermittently with message:
[2m[ 19948.510484s][0m [33m WARN[0m ThreadId(01) [1mserver[0m[1m{[0morig_dst=172.20.207.194:80[1m}[0m: linkerd_app_core::errors: Failed to proxy request: buffer's worker closed unexpectedly client.addr=10.250.162.208:59692
Additional context
The outbound destination is also a meshed service.
The linkerd-init container exited with “Completed” status in the pod.
Before and during the incident, there was no restart in either the application container or the proxy container.
Once we restarted the pod manually, the outbound traffic succeeds at 100% again.
linkerd check
output
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
issuer certificate will expire on 2021-06-06T08:57:23Z
see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
linkerd-api
-----------
√ control plane pods are ready
√ can initialize the client
√ can query the control plane API
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
is running version 2.10.0 but the latest stable version is 2.10.1
see https://linkerd.io/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 2.10.0 but the latest stable version is 2.10.1
see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match
linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
Status check results are √
Linkerd extensions checks
=========================
linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
‼ linkerd-viz pods are injected
could not find proxy container for prometheus-7b5758b6ff-xlqv4 pod
see https://linkerd.io/checks/#https://linkerd.io/checks/#l5d-viz-pods-injection for hints
√ viz extension pods are running
√ prometheus is installed and configured correctly
√ can initialize the client
√ viz extension self-check
Status check results are √
Environment
- Kubernetes Version: v1.18.9-eks-d1db3c
- Cluster Environment: EKS:
- Linkerd version: control plane: v2.10.0 linkerd-proxy: happened both for v2.139 and v2.142 linkerd-init: cr.l5d.io/linkerd/proxy-init:v1.3.9
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 24 (23 by maintainers)
Commits related to this issue
- deps: update `futures` to 0.3.15 This branch updates the `futures` crate to v0.3.15. This includes a fix for task starvation with `FuturesUnordered` (added in 0.3.13). This may or may not be related ... — committed to linkerd/linkerd2-proxy by hawkw 3 years ago
- deps: update `futures` to 0.3.15 This branch updates the `futures` crate to v0.3.15. This includes a fix for task starvation with `FuturesUnordered` (added in 0.3.13). This may or may not be related ... — committed to linkerd/linkerd2-proxy by hawkw 3 years ago
- deps: update `futures` to 0.3.15 (#1022) This branch updates the `futures` crate to v0.3.15. This includes a fix for task starvation with `FuturesUnordered` (added in 0.3.13). This may or may not b... — committed to linkerd/linkerd2-proxy by hawkw 3 years ago
- deps: update `futures` to 0.3.15 (#1022) This branch updates the `futures` crate to v0.3.15. This includes a fix for task starvation with `FuturesUnordered` (added in 0.3.13). This may or may not b... — committed to linkerd/drain-rs by hawkw 3 years ago
- ready-cache: Add endpoint-level debugging linkerd/linkerd2#6086 describes an issue that sounds closely related to tower-rs/tower#415: There's some sort of consistency issue between the ready-cache's ... — committed to olix0r/tower by olix0r 3 years ago
- ready-cache: Add endpoint-level debugging linkerd/linkerd2#6086 describes an issue that sounds closely related to tower-rs/tower#415: There's some sort of consistency issue between the ready-cache's ... — committed to olix0r/tower by olix0r 3 years ago
- update Tower to 0.4.13 to fix load balancer panic Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower... — committed to linkerd/linkerd2-proxy by hawkw 2 years ago
- update Tower to 0.4.13 to fix load balancer panic (#1758) Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally... — committed to linkerd/linkerd2-proxy by hawkw 2 years ago
- update Tower to 0.4.13 to fix load balancer panic (#1758) Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally ... — committed to linkerd/linkerd2-proxy by hawkw 2 years ago
- update Tower to 0.4.13 to fix load balancer panic (#1758) Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally ... — committed to linkerd/linkerd2-proxy by hawkw 2 years ago
- update Tower to 0.4.13 to fix load balancer panic (#1758) Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally ... — committed to linkerd/linkerd2-proxy by hawkw 2 years ago
- update Tower to 0.4.13 to fix load balancer panic (#1758) Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally ... — committed to linkerd/linkerd2-proxy by hawkw 2 years ago
@Wenliang-CHEN Thanks, this is helpful. I doubt that the futures change will help this issue. I suspect that there’s a race condition around updating the balancer with new endpoints where we enter an illegal state. We’ll focus more on stress testing the update path.