linkerd2: Outbound HTTP call failing intermittently
What is the issue?
Outbound HTTP calls to application fails with connection closed error intermittently. Source application from which HTTP calls failing is a scheduler application which makes regular calls to other application based on schedule. After the failure, the next schedule works fine. Please see the logs below from linkerd-proxy
How can it be reproduced?
Linkerd Version: 2.13.4
Logs, error output, etc
[129664.685446s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39860 [129664.691331s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39872 [129664.691557s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39878 [129664.692554s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39894 [129664.695277s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39900 [129784.342535s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42724 [129784.342604s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42738 [129784.342642s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42746 [129784.342682s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42760
output of linkerd check -o short
linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
issuer certificate will expire on 2023-06-25T04:48:41Z
see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
certificate will expire on 2023-06-24T04:48:45Z
see https://linkerd.io/2.13/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
certificate will expire on 2023-06-24T04:48:46Z
see https://linkerd.io/2.13/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
‼ policy-validator cert is valid for at least 60 days
certificate will expire on 2023-06-24T04:48:45Z
see https://linkerd.io/2.13/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints
linkerd-version
---------------
‼ cli is up-to-date
is running version 2.13.4 but the latest stable version is 2.13.5
see https://linkerd.io/2.13/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 2.13.4 but the latest stable version is 2.13.5
see https://linkerd.io/2.13/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-5b59946658-pm2wk (stable-2.13.4)
* linkerd-identity-54dfb7648b-hhp2r (stable-2.13.4)
* linkerd-proxy-injector-5fc4795598-4h68n (stable-2.13.4)
see https://linkerd.io/2.13/checks/#l5d-cp-proxy-version for hints
linkerd-viz
-----------
‼ tap API server cert is valid for at least 60 days
certificate will expire on 2023-06-24T04:22:23Z
see https://linkerd.io/2.13/checks/#l5d-tap-cert-not-expiring-soon for hints
‼ viz extension proxies are up-to-date
some proxies are not running the current version:
* metrics-api-d6665b56-7hs59 (stable-2.13.4)
* tap-6d45c9cd5c-wj2ft (stable-2.13.4)
* tap-injector-6c95fb4f97-q945w (stable-2.13.4)
* web-7c9756b6b-4bcvw (stable-2.13.4)
see https://linkerd.io/2.13/checks/#l5d-viz-proxy-cp-version for hints
Status check results are √
Environment
- Kubernetes Version: 1.23
- Cluster Environment: EKS
- OS: Linux
- Linkerd Version: 2.13.4
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 4
- Comments: 28 (9 by maintainers)
Commits related to this issue
- Increase HTTP request queue capacity In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies... — committed to linkerd/linkerd2-proxy by hawkw a year ago
- Increase HTTP request queue capacity (#2449) In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results ... — committed to linkerd/linkerd2-proxy by hawkw a year ago
- proxy: v2.207.0 In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much m... — committed to linkerd/linkerd2 by hawkw a year ago
- edge-23.8.1 (#11200) This edge release restores a proxy setting for it to shed load less aggressively while under high load, which should result in lower error rates (addressing #11055). It also re... — committed to linkerd/linkerd2 by alpeb a year ago
- Increase HTTP request queue capacity (#2449) In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results ... — committed to linkerd/linkerd2-proxy by hawkw a year ago
- Increase HTTP request queue capacity (#2449) In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results ... — committed to linkerd/linkerd2-proxy by hawkw a year ago
- proxy: v2.203.2 In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much m... — committed to linkerd/linkerd2 by hawkw a year ago
- stable-2.13.6 This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#1105... — committed to linkerd/linkerd2 by hawkw a year ago
- stable-2.13.6 This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#1105... — committed to linkerd/linkerd2 by hawkw a year ago
- proxy: v2.203.2 In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much m... — committed to linkerd/linkerd2 by hawkw a year ago
- stable-2.13.6 This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#1105... — committed to linkerd/linkerd2 by hawkw a year ago
I’m happy to report that the aggressive load shedding behavior has just been fixed in edge-23.8.1 released yesterday, and will be back-ported into stable-2.13.6. Now let’s focus on the destination cache staleness issue. While investigating this I found it’s possible that the
outbound_http_balancer_endpoints
metric has a bug, so let’s disregard that one for now. @bjoernw’s script above should help catching any discrepancies. Another thing to track are theendpoints_cache_size
andendpointslices_cache_size
on the Destination containers, which should match exactly the total number of endpoints in the cluster. I’m still trying to find a scenario to replicate this discrepancy. If you have debug logs for Destination under those circumstances that’d be helpful as well.We have experienced the same issue, and rolling back the proxy with linkerd2 stable-2.12.x worked. When observing the linekrd metrics exposed we could see the “loadshed” error as outlined above by @jandersen-plaid
This is easily reproduced by following the steps below
Prerequisites
Steps to replicate proxy errors “service unavailable”
web
deploymentHey folks - thanks for the reports here! Linkerd maintainers are actively investigating multiple things raised in this thread. Thanks to @nikskiz for the repro steps. We’re using these to investigate the loadshed behavior in the proxy and determine its root cause.
For the issues around destination controller stability - this is independent from the loadshed behavior. @alpeb will be following up to gather more information to determine what may be at play. More soon!
Similar to other posters, we have run into this issue and reported it elsewhere (see https://linkerd.buoyant.io/t/linkerd-destination-has-stale-data-in-endpoint-cache/171). @bjoernw, we also think it is a stale endpoints cache, but we found that we could recover from incidences of it by restarting the destination controllers (presumably this sends a RESET to the affected proxies, and that clears their caches?).
We could not reproduce it after ~2 weeks of trying, but our suspicion is that what happens is:
distribute
’sBackendCache
maybe?) in the proxy fills up with balancer endpoints that arepending
(whether through a cache without aggressive enough eviction or some other reason).pending
state, and the service moves into aloadshed
state after going intofailfast
.loadshed
it starts to consider alternate backends & endpoints that are in the affected cache.pending
are then each considered to try and find one available.ready
endpoints are never hit upon or evicted from the cache???ready
endpoints, so the loadshedding // failfast never recoversRESET
of the endpoint caches in the proxies, which moves the state out ofloadshed
//failfast
Of course, this is all conjecture: without a reproduction it is really hard to prove any of this is happening or get the necessary log info in order to track down which cache is failing in what way. Since we don’t by default log debug level, all we have are our metrics which indicate that the
outbound_http_balancer_endpoints
had large number ofpending
endpoints during the incident, some correlation that stale IPs were being accessed around the time of the incident (from our CNI) which resolved directly after we restarted the destination controller, and a huge rise inloadshed
errors fromoutbound_http_errors_total
Correlation between loadshedding behavior and stale ips (note that stale ips were being accessed before the major errors start, which indicates that some bad endpoints in the cache were being accessed on a relatively frequent basis, but not considered for routing http requests until something – a rise in latency or a few 504s – triggered
failfast
//loadshedding
behavior):Relationship between loadshedding errors and endpoints:
(it is worth noting that the divot in the graph is where we attempted to restart the affected inbound proxies // pods, but were unable to recover from the situation as we did not restart the nginx pods which were sending the outbound traffic to the affected deployment. At the end, the
ready
endpoints spike up as we considered that it was an availability issue and provisioned more pods (to no avail), and then finally the errors resolve when we restart the destination controller).Hopefully someone can come up with a reproduction, but we are going to downgrade to 1.12.5 for the time being.
Since our upgrade to 2.13.4 we’ve been experiencing the same errors in our cluster which result in transient 503 returned to the clients. It looks like the same or very similar problem reported in https://linkerd.buoyant.io/t/503-service-unavailable-from-proxy-with-large-number-of-connections/126. It doesn’t affect any service specifically, the whole mesh is affected randomly during peak load. There’s no throttling or any resource constraint issues, we’re not using service profiles, Linkerd’s circuit breaking or dynamic routing.
The log of the affected clients is filled with information, but I don’t see an answer as to why the proxy is marking services as unavailable:
Metrics are not helpful either, according to them, everything is working fine, but it isn’t. The proxy is not adding those 503 as failed requests, they’re not visible in viz or Prometheus. The only real clue comes from transport level metrics, where we see the same
loadshed
errors visible in debug logs, example from the same client pod as the logs:Sadly I don’t have reproduction instructions, but something must have changed between 2.12 and 2.13 that causes this odd behaviour.
Thanks for the update @tgolly @vkperumal. This is still on our radar.
Have upgraded to 2.13.6 to try and resolve the endpoint failures, and I’m still seeing tons of these
Failed to get node topology zone for node
errors on the destination controller like @bjoernw was seeing. This is under load.This is a huge problem. The destination-controller is broken on 2.13.4. Occasionally under load (i guess?) it will stop responding to diagnostics commands and also won’t serve up new IPs to proxies leading to a cluster wide failure until the controller is restarted. We’re runnings 300+ meshed pods with 5 pods per ctrl plane component so maybe this just shows up at this scale?
I have a job that runs every minute and diffs the IPs for a handful of endpoints per destination controller pod and per endpoint slice. During the outage this job was the canary in the coal mine because it stopped logging. Most likely because the diagnostics command hung. The destination controllers stopped logging anything at all (they usually complain about
Failed to get node topology zone for node
on a regular basis).I’m going to downgrade to 2.12 as soon as possible.
@jandersen-plaid that’s from the endpoint controller. In the past I had suspected that throttling at the API server level was causing the endpoints API itself to be out of sync with the current set of Pods hence leading the linkerd cache to be out of sync but it seems like the problem lies between endpoints and linkerd. I’m running a version of that script above in the cluster now with automatic restarts of the destination controller and extensive logging. Will report back if I find anything interesting.
I’m having this same issue on a cluster with several hundred meshed pods. I’m on 2.13.4. I worry that it’s an out-of-date endpoints cache issue that is causing the whole service to be deemed unavailable. The only way to recover from this issue is to restart all the proxies that are logging this issue.