linkerd2: Destination controller goes into failfast mode
What is the issue?
We use canaries with ArgoRollout via SMI meaning there is a lot of endpointslice churn due to frequent release trains and their 10-15min canary rollouts. (~300-400 meshed pods)
Occasionally we will see the destination service go into failfast mode, which is problematic during a canary deployment because old IPs of pods that were already torn down by ArgoRollout are still receiving traffic even though they should not. To remedy this we need to restart the destination deployment.
How can it be reproduced?
This happens intermittently under load. We run 5 instances of each control plane component.
Logs, error output, etc
The following are all logs from destination pods:
[ 18093.017333s] WARN ThreadId(01) linkerd_reconnect: Service failed error=channel closed
[951475.788267s] WARN ThreadId(01) linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[950833.887787s] WARN ThreadId(01) inbound:server{port=8086}:controller{addr=localhost:8086}: linkerd_stack::failfast: Service entering failfast after 1s
[950842.896953s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}: linkerd_stack::failfast: Service entering failfast after 10s
[ 18093.017325s] WARN ThreadId(01) inbound:server{port=8090}:rescue{client.addr=10.30.243.33:34358}: linkerd_app_inbound::http::server: Unexpected error error=client 10.30.243.33:34358: server: 10.30.205.39:8090: server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed error.sources=[server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed, operation was canceled: connection closed, connection closed]
[ 3.751136s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 14118.108714s] WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107) error.sources=[Transport endpoint is not connected (os error 107)]
[ 18093.017344s] INFO ThreadId(01) inbound:server{port=8090}:rescue{client.addr=10.30.156.57:57528}: linkerd_app_core::errors::respond: gRPC request failed error=client 10.30.156.57:57528: server: 10.30.205.39:8090: server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed error.sources=[server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed, operation was canceled: connection closed, connection closed]
output of linkerd check -o short
linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
issuer certificate will expire on --------------
see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
linkerd-version
---------------
‼ cli is up-to-date
is running version 2.13.4 but the latest stable version is 2.13.6
see https://linkerd.io/2.13/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane and cli versions match
control plane running stable-2.13.6 but cli running stable-2.13.4
see https://linkerd.io/2.13/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies and cli versions match
linkerd-destination-789df68585-288sg running stable-2.13.6 but cli running stable-2.13.4
see https://linkerd.io/2.13/checks/#l5d-cp-proxy-cli-version for hints
linkerd-viz
-----------
‼ viz extension proxies and cli versions match
linkerd-destination-789df68585-288sg running stable-2.13.6 but cli running stable-2.13.4
see https://linkerd.io/2.13/checks/#l5d-viz-proxy-cli-version for hints
Status check results are √
Environment
- v1.24.16-eks
- linkerd: 2.13.6
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 4
- Comments: 17 (8 by maintainers)
Commits related to this issue
- edge-23.10.3 (#11513) ## edge-23.10.3 This edge release fixes issues in the proxy and destination controller which can result in Linkerd proxies sending traffic to stale endpoints. In addition, i... — committed to linkerd/linkerd2 by hawkw 8 months ago
- stable-2.14.2 This stable release fixes issues in the proxy and Destination controller which can result in Linkerd proxies sending traffic to stale endpoints. In addition, it contains a bug fix for p... — committed to linkerd/linkerd2 by mateiidavid 8 months ago
- stable-2.14.2 (#11539) This stable release fixes issues in the proxy and Destination controller which can result in Linkerd proxies sending traffic to stale endpoints. In addition, it contains a bu... — committed to linkerd/linkerd2 by mateiidavid 8 months ago
Wanted to chime in with my team’s experience. Not completely sure if this is the same issue or not, but it look very similar. We occasionally observe application pods going into a loop where we see a large number of 5xx errors. Our cluster runs on EKS, linkerd 2.14.1 . Log Lines from application pod’s linkerd-proxy:
Starts at 2023-10-07 01:46:32
The cluster in question has about 2000, to 3000 pods in the mesh at load. The applications all autoscale, so pods are coming in and out constantly throughout the day. Normally, everything is working great, but it occasionally get’s stuck and starts reporting the above errors for every request. Restarting the destination linkerd pods does fix the issue. Interesting also, is that restarting the individual pods having connection issues also temporarily fixes the issue, although for much shorter period of time. For example, restarting linkerd-destination fixes the issue for a few days or a week. Restarting individual failing pods, fixes the issue for a few minutes or hours.
Logs from the destination controller. The logs timestamps are not really correlated and come after the pod application errors from above Starts at 2023-10-07 05:55:47
There is one interesting log that does come at around the same time as the errors: Timestamp 2023-10-07 01:27:04
We have a theory that we’re chasing down and we’ve identified a prometheus query that can help highlight this problem:
If you don’t have a prometheus setup, you can approximate this by faking a timeseries by snapshotting the control plane’s state a few times:
We expect that all service updates should flat-line for a single destination controller instance. Something like:
We also expect that other destination controller instances should not experience any directly correlated failures of this nature.
We’re still working to confirm our hypothesis and to ultimately reproduce this issue, but in the meantime capturing data from incidents of this failure will most effectively help to confirm or disprove our theory.
Yes, we are scaling up/down nodes as well.