linkerd2: Outbound requests fail intermittently after the proxy reported "panicked at 'cancel sender lost'"

Bug Report

What is the issue?

The outbound requests of a meshed pod fail intermittently after its linkerd-proxy reported “panicked at ‘cancel sender lost’”.

We are not sure what triggers the issue. From the logs we can tell that after the linkerd-proxy emits the following:

thread 'main' panicked at 'cancel sender lost', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.6/src/ready_cache/cache.rs:397:13

Then around 50% of the outbound requests starts failing intermittently with message:

[2m[ 19948.510484s]  WARN ThreadId(01) server{orig_dst=172.20.207.194:80}: linkerd_app_core::errors: Failed to proxy request: buffer's worker closed unexpectedly client.addr=10.250.162.208:59692

Additional context

The outbound destination is also a meshed service.

The linkerd-init container exited with “Completed” status in the pod.

Before and during the incident, there was no restart in either the application container or the proxy container.

Once we restarted the pod manually, the outbound traffic succeeds at 100% again.

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2021-06-06T08:57:23Z
    see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
linkerd-api
-----------
√ control plane pods are ready
√ can initialize the client
√ can query the control plane API
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 2.10.0 but the latest stable version is 2.10.1
    see https://linkerd.io/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.10.0 but the latest stable version is 2.10.1
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match
linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
Status check results are √
Linkerd extensions checks
=========================
linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
‼ linkerd-viz pods are injected
    could not find proxy container for prometheus-7b5758b6ff-xlqv4 pod
    see https://linkerd.io/checks/#https://linkerd.io/checks/#l5d-viz-pods-injection for hints
√ viz extension pods are running
√ prometheus is installed and configured correctly
√ can initialize the client
√ viz extension self-check
Status check results are √

Environment

  • Kubernetes Version: v1.18.9-eks-d1db3c
  • Cluster Environment: EKS:
  • Linkerd version: control plane: v2.10.0 linkerd-proxy: happened both for v2.139 and v2.142 linkerd-init: cr.l5d.io/linkerd/proxy-init:v1.3.9

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 24 (23 by maintainers)

Commits related to this issue

Most upvoted comments

@Wenliang-CHEN Thanks, this is helpful. I doubt that the futures change will help this issue. I suspect that there’s a race condition around updating the balancer with new endpoints where we enter an illegal state. We’ll focus more on stress testing the update path.