linkerd2: Destination controller goes into failfast mode

What is the issue?

We use canaries with ArgoRollout via SMI meaning there is a lot of endpointslice churn due to frequent release trains and their 10-15min canary rollouts. (~300-400 meshed pods)

Occasionally we will see the destination service go into failfast mode, which is problematic during a canary deployment because old IPs of pods that were already torn down by ArgoRollout are still receiving traffic even though they should not. To remedy this we need to restart the destination deployment.

How can it be reproduced?

This happens intermittently under load. We run 5 instances of each control plane component.

Logs, error output, etc

The following are all logs from destination pods:

[ 18093.017333s]  WARN ThreadId(01) linkerd_reconnect: Service failed error=channel closed

[951475.788267s]  WARN ThreadId(01) linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

[950833.887787s]  WARN ThreadId(01) inbound:server{port=8086}:controller{addr=localhost:8086}: linkerd_stack::failfast: Service entering failfast after 1s

[950842.896953s]  WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}: linkerd_stack::failfast: Service entering failfast after 10s

[ 18093.017325s]  WARN ThreadId(01) inbound:server{port=8090}:rescue{client.addr=10.30.243.33:34358}: linkerd_app_inbound::http::server: Unexpected error error=client 10.30.243.33:34358: server: 10.30.205.39:8090: server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed error.sources=[server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed, operation was canceled: connection closed, connection closed]

[     3.751136s]  WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

[ 14118.108714s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107) error.sources=[Transport endpoint is not connected (os error 107)]

[ 18093.017344s]  INFO ThreadId(01) inbound:server{port=8090}:rescue{client.addr=10.30.156.57:57528}: linkerd_app_core::errors::respond: gRPC request failed error=client 10.30.156.57:57528: server: 10.30.205.39:8090: server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed error.sources=[server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed, operation was canceled: connection closed, connection closed]

output of `linkerd check -o short`

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on --------------
    see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.13.4 but the latest stable version is 2.13.6
    see https://linkerd.io/2.13/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane and cli versions match
    control plane running stable-2.13.6 but cli running stable-2.13.4
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies and cli versions match
    linkerd-destination-789df68585-288sg running stable-2.13.6 but cli running stable-2.13.4
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-cli-version for hints

linkerd-viz
-----------
‼ viz extension proxies and cli versions match
    linkerd-destination-789df68585-288sg running stable-2.13.6 but cli running stable-2.13.4
    see https://linkerd.io/2.13/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

v1.24.16-eks
linkerd: 2.13.6

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

About this issue

Original URL
State: closed
Created 10 months ago
Reactions: 4
Comments: 17 (8 by maintainers)

Commits related to this issue

edge-23.10.3 (#11513) ## edge-23.10.3 This edge release fixes issues in the proxy and destination controller which can result in Linkerd proxies sending traffic to stale endpoints. In addition, i... — committed to linkerd/linkerd2 by hawkw 8 months ago
stable-2.14.2 This stable release fixes issues in the proxy and Destination controller which can result in Linkerd proxies sending traffic to stale endpoints. In addition, it contains a bug fix for p... — committed to linkerd/linkerd2 by mateiidavid 8 months ago
stable-2.14.2 (#11539) This stable release fixes issues in the proxy and Destination controller which can result in Linkerd proxies sending traffic to stale endpoints. In addition, it contains a bu... — committed to linkerd/linkerd2 by mateiidavid 8 months ago

Most upvoted comments

Wanted to chime in with my team’s experience. Not completely sure if this is the same issue or not, but it look very similar. We occasionally observe application pods going into a loop where we see a large number of 5xx errors. Our cluster runs on EKS, linkerd 2.14.1 . Log Lines from application pod’s linkerd-proxy:

Starts at 2023-10-07 01:46:32

[ 1577.793693s] WARN ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}:service{ns=yy name=xx port=50051}:endpoint{addr=10.43.52.17:50051}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

[ 5.540692s] INFO ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}:rescue{client.addr=10.43.26.109:57910}: linkerd_app_core::errors::respond: gRPC request failed error=logical service xx.yy.svc.cluster.local:50051: service in fail-fast error.sources=[service in fail-fast] [ 5.540438s] WARN ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}: linkerd_stack::failfast: Service entering failfast after 3s

The cluster in question has about 2000, to 3000 pods in the mesh at load. The applications all autoscale, so pods are coming in and out constantly throughout the day. Normally, everything is working great, but it occasionally get’s stuck and starts reporting the above errors for every request. Restarting the destination linkerd pods does fix the issue. Interesting also, is that restarting the individual pods having connection issues also temporarily fixes the issue, although for much shorter period of time. For example, restarting linkerd-destination fixes the issue for a few days or a week. Restarting individual failing pods, fixes the issue for a few minutes or hours.

Logs from the destination controller. The logs timestamps are not really correlated and come after the pod application errors from above Starts at 2023-10-07 05:55:47

[ 4.771258s] WARN ThreadId(01) watch{port=4191}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 4.269445s] WARN ThreadId(01) watch{port=4191}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

There is one interesting log that does come at around the same time as the errors: Timestamp 2023-10-07 01:27:04

time=“2023-10-07T08:27:04Z” level=error msg=“failed to send protocol update: rpc error: code = Canceled desc = context canceled” addr=“:8086” component=server ns=zz pod=cc

thjwhite on Oct 9, 2023

We have a theory that we’re chasing down and we’ve identified a prometheus query that can help highlight this problem:

sum (endpoints_updates{component='destination'}) by(instance, service, port)

If you don’t have a prometheus setup, you can approximate this by faking a timeseries by snapshotting the control plane’s state a few times:

date && linkerd diagnostics controller-metrics && sleep 60 && date && linkerd diagnostics controller-metrics

We expect that all service updates should flat-line for a single destination controller instance. Something like:

We also expect that other destination controller instances should not experience any directly correlated failures of this nature.

We’re still working to confirm our hypothesis and to ultimately reproduce this issue, but in the meantime capturing data from incidents of this failure will most effectively help to confirm or disprove our theory.

olix0r on Oct 12, 2023

Yes, we are scaling up/down nodes as well.

thjwhite on Oct 11, 2023