linkerd2: Outbound HTTP call failing intermittently

What is the issue?

Outbound HTTP calls to application fails with connection closed error intermittently. Source application from which HTTP calls failing is a scheduler application which makes regular calls to other application based on schedule. After the failure, the next schedule works fine. Please see the logs below from linkerd-proxy

How can it be reproduced?

Linkerd Version: 2.13.4

Logs, error output, etc

[129664.685446s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39860 [129664.691331s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39872 [129664.691557s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39878 [129664.692554s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39894 [129664.695277s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39900 [129784.342535s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42724 [129784.342604s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42738 [129784.342642s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42746 [129784.342682s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42760

output of linkerd check -o short

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2023-06-25T04:48:41Z
    see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
    certificate will expire on 2023-06-24T04:48:45Z
    see https://linkerd.io/2.13/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
    certificate will expire on 2023-06-24T04:48:46Z
    see https://linkerd.io/2.13/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
‼ policy-validator cert is valid for at least 60 days
    certificate will expire on 2023-06-24T04:48:45Z
    see https://linkerd.io/2.13/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.13.4 but the latest stable version is 2.13.5
    see https://linkerd.io/2.13/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.13.4 but the latest stable version is 2.13.5
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-5b59946658-pm2wk (stable-2.13.4)
	* linkerd-identity-54dfb7648b-hhp2r (stable-2.13.4)
	* linkerd-proxy-injector-5fc4795598-4h68n (stable-2.13.4)
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-version for hints

linkerd-viz
-----------
‼ tap API server cert is valid for at least 60 days
    certificate will expire on 2023-06-24T04:22:23Z
    see https://linkerd.io/2.13/checks/#l5d-tap-cert-not-expiring-soon for hints
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-d6665b56-7hs59 (stable-2.13.4)
	* tap-6d45c9cd5c-wj2ft (stable-2.13.4)
	* tap-injector-6c95fb4f97-q945w (stable-2.13.4)
	* web-7c9756b6b-4bcvw (stable-2.13.4)
    see https://linkerd.io/2.13/checks/#l5d-viz-proxy-cp-version for hints

Status check results are √

Environment

  • Kubernetes Version: 1.23
  • Cluster Environment: EKS
  • OS: Linux
  • Linkerd Version: 2.13.4

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 4
  • Comments: 28 (9 by maintainers)

Commits related to this issue

Most upvoted comments

I’m happy to report that the aggressive load shedding behavior has just been fixed in edge-23.8.1 released yesterday, and will be back-ported into stable-2.13.6. Now let’s focus on the destination cache staleness issue. While investigating this I found it’s possible that the outbound_http_balancer_endpoints metric has a bug, so let’s disregard that one for now. @bjoernw’s script above should help catching any discrepancies. Another thing to track are the endpoints_cache_size and endpointslices_cache_size on the Destination containers, which should match exactly the total number of endpoints in the cluster. I’m still trying to find a scenario to replicate this discrepancy. If you have debug logs for Destination under those circumstances that’d be helpful as well.

We have experienced the same issue, and rolling back the proxy with linkerd2 stable-2.12.x worked. When observing the linekrd metrics exposed we could see the “loadshed” error as outlined above by @jandersen-plaid

This is easily reproduced by following the steps below

Prerequisites

Steps to replicate proxy errors “service unavailable”

  1. Create the k6 test
cat <<EOF >> k6s-test.js
// import necessary modules
import { check } from "k6";
import http from "k6/http";

// define configuration
export const options = {
    // define thresholds
    thresholds: {
      http_req_failed: ['rate<0.01'], // http errors should be less than 1%
      http_req_duration: ['p(99)<1000'], // 99% of requests should be below 1s
    },
    // define scenarios
    scenarios: {
      // arbitrary name of scenario
      average_load: {
        executor: "ramping-vus",
        stages: [
          { duration: "10s", target: 500 },
        ],
      },
    }
  };

export default function () {
  // define URL and request body
  const url = "http://localhost:8080/api/vote?choice=:rabbit:";
  const payload = JSON.stringify({
    choice: ":rabbit:"
  });
  const params = {
    headers: {
      "Content-Type": "application/json",
    },
  };

  // send a post request and save response as a variable
  const res = http.post(url, payload, params);

  // check that response is 200
  check(res, {
    "response code was 200": (res) => res.status == 200,
  });
}
EOF
  1. Install k8s cluster using kind
kind create cluster
  1. Install Linkerd2 CRDS
linkerd install --crds | kubectl apply -f -
  1. Install Linkerd2
linkerd install | kubectl apply -f -
  1. Install viz
linkerd viz install | kubectl apply -f -
  1. Validate installation
linkerd check
  1. Install emojivoto
linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f -
  1. Check emojivoto status
kubectl -n emojivoto rollout status deploy/web
  1. In a separate terminal port forward to emojivoto web app
kubectl port-forward deployment/web 8080 -n emojivoto
  1. In a third terminal, follow the proxy logs on the emojivoto web deployment
kubectl logs -f -n emojivoto  deployment/web -c linkerd-proxy
  1. Run k6s test and observer results
k6 run k6s-test.js

Hey folks - thanks for the reports here! Linkerd maintainers are actively investigating multiple things raised in this thread. Thanks to @nikskiz for the repro steps. We’re using these to investigate the loadshed behavior in the proxy and determine its root cause.

For the issues around destination controller stability - this is independent from the loadshed behavior. @alpeb will be following up to gather more information to determine what may be at play. More soon!

I worry that it’s an out-of-date endpoints cache issue

Similar to other posters, we have run into this issue and reported it elsewhere (see https://linkerd.buoyant.io/t/linkerd-destination-has-stale-data-in-endpoint-cache/171). @bjoernw, we also think it is a stale endpoints cache, but we found that we could recover from incidences of it by restarting the destination controllers (presumably this sends a RESET to the affected proxies, and that clears their caches?).

We could not reproduce it after ~2 weeks of trying, but our suspicion is that what happens is:

  1. An endpoint or discovery cache (distribute’s BackendCache maybe?) in the proxy fills up with balancer endpoints that are pending (whether through a cache without aggressive enough eviction or some other reason).
  2. Some failure causes the proxy to find all endpoints in a pending state, and the service moves into a loadshed state after going into failfast.
  3. When the proxy goes into a loadshed it starts to consider alternate backends & endpoints that are in the affected cache.
  4. Those balancer endpoints that are in pending are then each considered to try and find one available.
  5. Somehow the appropriate endpoints // the ready endpoints are never hit upon or evicted from the cache???
  6. This causes the affected proxies to never be able to route requests to ready endpoints, so the loadshedding // failfast never recovers
  7. Removing all affected proxies forces the endpoints to be refreshed, or restarting the destination control plane forces a RESET of the endpoint caches in the proxies, which moves the state out of loadshed // failfast

Of course, this is all conjecture: without a reproduction it is really hard to prove any of this is happening or get the necessary log info in order to track down which cache is failing in what way. Since we don’t by default log debug level, all we have are our metrics which indicate that the outbound_http_balancer_endpoints had large number of pending endpoints during the incident, some correlation that stale IPs were being accessed around the time of the incident (from our CNI) which resolved directly after we restarted the destination controller, and a huge rise in loadshed errors from outbound_http_errors_total

Correlation between loadshedding behavior and stale ips (note that stale ips were being accessed before the major errors start, which indicates that some bad endpoints in the cache were being accessed on a relatively frequent basis, but not considered for routing http requests until something – a rise in latency or a few 504s – triggered failfast // loadshedding behavior): image

Relationship between loadshedding errors and endpoints: image

(it is worth noting that the divot in the graph is where we attempted to restart the affected inbound proxies // pods, but were unable to recover from the situation as we did not restart the nginx pods which were sending the outbound traffic to the affected deployment. At the end, the ready endpoints spike up as we considered that it was an availability issue and provisioned more pods (to no avail), and then finally the errors resolve when we restart the destination controller).

Hopefully someone can come up with a reproduction, but we are going to downgrade to 1.12.5 for the time being.

Since our upgrade to 2.13.4 we’ve been experiencing the same errors in our cluster which result in transient 503 returned to the clients. It looks like the same or very similar problem reported in https://linkerd.buoyant.io/t/503-service-unavailable-from-proxy-with-large-number-of-connections/126. It doesn’t affect any service specifically, the whole mesh is affected randomly during peak load. There’s no throttling or any resource constraint issues, we’re not using service profiles, Linkerd’s circuit breaking or dynamic routing.

image

The log of the affected clients is filled with information, but I don’t see an answer as to why the proxy is marking services as unavailable:

2023-06-28T10:52:48Z	[ 94126.531685s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36056}:proxy{addr=10.100.109.178:80}:http: linkerd_stack::failfast: Service has become unavailable
2023-06-28T10:52:48Z	[ 94126.531675s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z	[ 94126.531668s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "3081", "content-type": "application/json", "traceparent": "00-083c196cf720aab5452406dfa155fda3-19bc1b8cdb26cebd-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z	[ 94126.531665s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z	[ 94126.531657s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17df00, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 43, rx_fields: "..." } } } })
2023-06-28T10:52:48Z	[ 94126.531641s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z	[ 94126.531634s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "22912", "content-type": "application/json", "traceparent": "00-23856363099ba7654f627b8ab7e2905a-9328a2c69cd37c13-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z	[ 94126.531630s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z	[ 94126.531620s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17e400, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 23, rx_fields: "..." } } } })
2023-06-28T10:52:48Z	[ 94126.531598s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z	[ 94126.531564s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "94320", "content-type": "application/json", "traceparent": "00-a638ca610eaefb86a771ebaaa411f072-9b465174c9c12082-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z	[ 94126.531559s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z	[ 94126.531541s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17da00, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 24, rx_fields: "..." } } } })
2023-06-28T10:52:48Z	[ 94126.531466s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}: linkerd_app_core::serve: Connection closed
2023-06-28T10:52:48Z	[ 94126.531444s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
2023-06-28T10:52:48Z	[ 94126.531357s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:41748}: linkerd_app_core::serve: Connection closed
2023-06-28T10:52:48Z	[ 94126.531314s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:41748}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::server: The stack is tearing down the connection
2023-06-28T10:52:48Z	[ 94126.531265s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35784}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error
2023-06-28T10:52:48Z	[ 94126.531237s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35776}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: sending EOS to classify
2023-06-28T10:52:48Z	[ 94126.531232s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35776}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: dropping ResponseBody
2023-06-28T10:52:48Z	[ 94126.531214s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error
2023-06-28T10:52:48Z	[ 94126.531173s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35720}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: sending EOS to classify
2023-06-28T10:52:48Z	[ 94126.531165s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35720}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: dropping ResponseBody
2023-06-28T10:52:48Z	[ 94126.531148s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35766}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error
2023-06-28T10:52:48Z	[ 94126.531121s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_stack::failfast: Service has become unavailable
2023-06-28T10:52:48Z	[ 94126.531109s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z	[ 94126.531105s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "52431", "content-type": "application/json", "traceparent": "00-8c14a06804f62ea8bb6f352ee3f4f0e4-4acb6b9ebed60214-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z	[ 94126.531102s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z	[ 94126.530991s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35766}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17e400, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 22, rx_fields: "..." } } } })
2023-06-28T10:52:48Z	[ 94126.530792s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_app_core::errors::respond: Handling error on HTTP connection status=503 Service Unavailable version=HTTP/1.1 close=true
2023-06-28T10:52:48Z	[ 94126.530791s] DEBUG ThreadId(02) outbound: linkerd_app_core::serve: Connection closed
2023-06-28T10:52:48Z	[ 94126.530785s]  INFO ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http:rescue{client.addr=172.16.85.145:52548}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.100.109.178:80: service unavailable error.sources=[service unavailable]
2023-06-28T10:52:48Z	[ 94126.530775s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_stack::loadshed: Service shedding load
2023-06-28T10:52:48Z	[ 94126.530767s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_stack::loadshed: Service has become unavailable

Metrics are not helpful either, according to them, everything is working fine, but it isn’t. The proxy is not adding those 503 as failed requests, they’re not visible in viz or Prometheus. The only real clue comes from transport level metrics, where we see the same loadshed errors visible in debug logs, example from the same client pod as the logs:

image

Sadly I don’t have reproduction instructions, but something must have changed between 2.12 and 2.13 that causes this odd behaviour.

Thanks for the update @tgolly @vkperumal. This is still on our radar.

Have upgraded to 2.13.6 to try and resolve the endpoint failures, and I’m still seeing tons of these Failed to get node topology zone for node errors on the destination controller like @bjoernw was seeing. This is under load.

2023-08-23T16:15:28+10:00 time="2023-08-23T06:15:28Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:49300" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:16:00+10:00 time="2023-08-23T06:16:00Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:60264" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:16:33+10:00 time="2023-08-23T06:16:33Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:40158" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:17:05+10:00 time="2023-08-23T06:17:05Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:53556" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:17:38+10:00 time="2023-08-23T06:17:38Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:36490" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:18:10+10:00 time="2023-08-23T06:18:10Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:60976" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:18:43+10:00 time="2023-08-23T06:18:43Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:34732" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:19:16+10:00 time="2023-08-23T06:19:16Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:38158" service="service-xxx.ns.svc.cluster.local:8080"

This is a huge problem. The destination-controller is broken on 2.13.4. Occasionally under load (i guess?) it will stop responding to diagnostics commands and also won’t serve up new IPs to proxies leading to a cluster wide failure until the controller is restarted. We’re runnings 300+ meshed pods with 5 pods per ctrl plane component so maybe this just shows up at this scale?

I have a job that runs every minute and diffs the IPs for a handful of endpoints per destination controller pod and per endpoint slice. During the outage this job was the canary in the coal mine because it stopped logging. Most likely because the diagnostics command hung. The destination controllers stopped logging anything at all (they usually complain about Failed to get node topology zone for node on a regular basis).

I’m going to downgrade to 2.12 as soon as possible.

@jandersen-plaid that’s from the endpoint controller. In the past I had suspected that throttling at the API server level was causing the endpoints API itself to be out of sync with the current set of Pods hence leading the linkerd cache to be out of sync but it seems like the problem lies between endpoints and linkerd. I’m running a version of that script above in the cluster now with automatic restarts of the destination controller and extensive logging. Will report back if I find anything interesting.

I’m having this same issue on a cluster with several hundred meshed pods. I’m on 2.13.4. I worry that it’s an out-of-date endpoints cache issue that is causing the whole service to be deemed unavailable. The only way to recover from this issue is to restart all the proxies that are logging this issue.