linkerd2: "HTTP Logical service in fail-fast" logs in linkerd-proxy

Bug Report

In our system, we have 400+ lightweight pods, which are making requests with gRPC to 6 powerful pods (runs on preemtible node) with 3-4 RPS for each. (All 6 pods have Nvidia GPU) After 1-2 mins of successful deployment and auto Linkerd proxy injection with kubectl -n the_namespace get deploy/the_deployment -o yaml | linkerd inject - | kubectl apply -f -, we are seeing hundreds of Failed to proxy request: HTTP Logical service in fail-fast errors. I’ve checked the source code of linkerd-proxy repository, I found the place where this message is printed but couldn’t understand the error, and the message. I mean, what is HTTP Logical service? Below you can find a snippet of logs.

WARN ThreadId(01) inbound:accept{peer.addr=10.37.125.7:40084 target.addr=10.36.216.4:50053}: linkerd_timeout::failfast: HTTP Logical entering failfast after 1s
WARN ThreadId(01) inbound:accept{peer.addr=10.37.126.8:46412 target.addr=10.36.216.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
WARN ThreadId(01) inbound:accept{peer.addr=10.37.61.7:37474 target.addr=10.36.216.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
WARN ThreadId(01) inbound:accept{peer.addr=10.37.72.4:52626 target.addr=10.36.216.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
... (tens of them in same second, hundreds in minute)

I’ve waited for 10-15 mins, but system didn’t recover. So I reverted and uninject proxy from pods.

In linkerd-proxy source code, I think there was a comment saying “the inner service is not ready”, but readiness/liveness was passing. I thought this might be about dispatch timeout, wanted to increase it but it’s not configurable. (config.rs) I thought it might be related with all this high load. I installed with HA mode, nothing is changed. I increased both CPU and memory proxy request/limits, but again nothing is changed.

All after these actions we’re getting fail fast errors. What is fail fast condition? What is “HTTP logical”? 🧐 How can we resolve this?

linkerd check output:

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
√ multiple replicas of control plane pods

Status check results are √

Environment

Kubernetes: 1.16.15-gke.6000
Linkerd Version: stable-2.9.2 with HA
Linkerd Proxy Version: edge-21.1.3

Additional context

Let me know if you need further information.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 29 (15 by maintainers)

Most upvoted comments

@adleong I use stable-2.10.0 and I see the same logs on all linkerd-proxy containers

[bogdan@admin scripts]$ k logs linkerd-destination-5694c5787c-pm27w linkerd-proxy
time="2021-03-19T13:37:56Z" level=info msg="running version stable-2.10.0"
[     0.002365s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.003148s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.003166s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.003171s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.003174s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[     0.003177s]  INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.003181s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.003203s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via localhost:8086
[     0.005384s]  WARN ThreadId(01) daemon:identity: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.cluster.local. type: SRV class: IN
[     0.007913s]  WARN ThreadId(01) daemon:identity: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.cluster.local. type: SRV class: IN
[     1.990255s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.49.169:44714 target.addr=10.112.57.233:9996}: linkerd_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)
[     2.522311s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=127.0.0.1:35320 target.addr=127.0.0.1:4191}: linkerd_app_core::serve: Connection closed error=TLS detection timeout
[     5.008077s]  WARN ThreadId(01) daemon:identity: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.cluster.local. type: SRV class: IN
[     5.139031s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.49.169:44732 target.addr=10.112.57.233:9996}: linkerd_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)
[    10.517323s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[    15.020669s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.66.72:55734 target.addr=10.112.57.233:8086}: linkerd_app_core::control: Failed to resolve control-plane component error=end of stream reached
[    15.530595s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.84.113:36198 target.addr=10.112.57.233:8086}: linkerd_app_core::control: Failed to resolve control-plane component error=end of stream reached
[   127.078861s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.66.72:55734 target.addr=10.112.57.233:8086}: linkerd_app_core::control: Failed to resolve control-plane component error=end of stream reached
[   214.630995s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.79.192:57094 target.addr=10.112.57.233:8086}: linkerd_app_core::control: Failed to resolve control-plane component error=end of stream reached
[   242.964905s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.64.89:41700 target.addr=10.112.57.233:8086}: linkerd_app_core::control: Failed to resolve control-plane component error=end of stream reached
[   246.823316s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.90.217:41114 target.addr=10.112.57.233:8086}: linkerd_app_core::control: Failed to resolve control-plane component error=end of stream reached

pepsi1k on Mar 19, 2021

Our issue was caused by https://github.com/linkerd/linkerd2/issues/5684.

Might be worth checking the latency of linkerd-destination when these errors happen to see whether the root cause is the same.

fpetkovski on Feb 11, 2021

@olix0r @hawkw We have updated max_concurrent_streams from 10 to 100. No more fail-fast errors. 🎉

Thanks for the tips and help!

hadican on Feb 3, 2021

Coming here from https://github.com/linkerd/linkerd2/issues/5183. We were experiencing fail fast errors on the server side proxy when having max-concurrent-streams set.

Now we are seeing the Failed to proxy request: HTTP Logical service in fail-fast errors on another service which definitely doesn’t have this configuration. ^^

A bit of context to the service in question:

it takes about 30 minutes to actually boot up, during that time the http liveness probes and prometheus metrics calls will fail with connection refused because the server is actually not listening yet. These will eventually served by a normal HTTP server listening on 8082.
the service serves the same grpc service on two different ports 8081 (no TLS) and 8084 (TLS using self signed certs). The reason for this is the need for certificates when connecting gcloud ingress with grpc service. The second port (8084) is not used at all at the moment, but it is exposed in the deployment.

I used the configuration @olix0r suggested here. The proxy image we are using is: ghcr.io/olix0r/l2-proxy:main.b8c2aa71.

I put the logs from the startup of the service and after it fully booted up in this gist.

One log message caught my eye, which is: linkerd_stack::switch_ready: Switching to secondary service after=500ms What does that mean? I couldn’t find anything in the repo.

We currently have a version running of this service without any issues which only has one port that serves grpc (8081 no TLS) and is running this proxy image ghcr.io/linkerd/proxy:stable-2.9.0.

Hope this helps.

karsten42 on Feb 2, 2021