linkerd2: "HTTP Logical service in fail-fast" logs in linkerd-proxy
Bug Report
In our system, we have 400+ lightweight pods, which are making requests with gRPC to 6 powerful pods (runs on preemtible node) with 3-4 RPS for each. (All 6 pods have Nvidia GPU) After 1-2 mins of successful deployment and auto Linkerd proxy injection with kubectl -n the_namespace get deploy/the_deployment -o yaml | linkerd inject - | kubectl apply -f -
, we are seeing hundreds of Failed to proxy request: HTTP Logical service in fail-fast
errors. I’ve checked the source code of linkerd-proxy
repository, I found the place where this message is printed but couldn’t understand the error, and the message. I mean, what is HTTP Logical service
? Below you can find a snippet of logs.
WARN ThreadId(01) inbound:accept{peer.addr=10.37.125.7:40084 target.addr=10.36.216.4:50053}: linkerd_timeout::failfast: HTTP Logical entering failfast after 1s
WARN ThreadId(01) inbound:accept{peer.addr=10.37.126.8:46412 target.addr=10.36.216.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
WARN ThreadId(01) inbound:accept{peer.addr=10.37.61.7:37474 target.addr=10.36.216.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
WARN ThreadId(01) inbound:accept{peer.addr=10.37.72.4:52626 target.addr=10.36.216.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
... (tens of them in same second, hundreds in minute)
I’ve waited for 10-15 mins, but system didn’t recover. So I reverted and uninject proxy from pods.
In linkerd-proxy
source code, I think there was a comment saying “the inner service is not ready”, but readiness/liveness was passing. I thought this might be about dispatch timeout, wanted to increase it but it’s not configurable. (config.rs
) I thought it might be related with all this high load. I installed with HA mode, nothing is changed. I increased both CPU and memory proxy request/limits, but again nothing is changed.
All after these actions we’re getting fail fast errors. What is fail fast condition? What is “HTTP logical”? 🧐 How can we resolve this?
linkerd check
output:
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running
linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date
control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match
linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
√ multiple replicas of control plane pods
Status check results are √
Environment
- Kubernetes: 1.16.15-gke.6000
- Linkerd Version: stable-2.9.2 with HA
- Linkerd Proxy Version: edge-21.1.3
Additional context
Let me know if you need further information.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 29 (15 by maintainers)
@adleong I use stable-2.10.0 and I see the same logs on all linkerd-proxy containers
Our issue was caused by https://github.com/linkerd/linkerd2/issues/5684.
Might be worth checking the latency of
linkerd-destination
when these errors happen to see whether the root cause is the same.@olix0r @hawkw We have updated
max_concurrent_streams
from 10 to 100. No more fail-fast errors. 🎉Thanks for the tips and help!
Coming here from https://github.com/linkerd/linkerd2/issues/5183. We were experiencing
fail fast
errors on the server side proxy when havingmax-concurrent-streams
set.Now we are seeing the
Failed to proxy request: HTTP Logical service in fail-fast
errors on another service which definitely doesn’t have this configuration. ^^A bit of context to the service in question:
liveness probes
andprometheus metrics calls
will fail withconnection refused
because the server is actually not listening yet. These will eventually served by a normal HTTP server listening on 8082.grpc
service on two different ports 8081 (no TLS) and 8084 (TLS using self signed certs). The reason for this is the need for certificates when connecting gcloud ingress with grpc service. The second port (8084) is not used at all at the moment, but it is exposed in the deployment.I used the configuration @olix0r suggested here. The proxy image we are using is:
ghcr.io/olix0r/l2-proxy:main.b8c2aa71
.I put the logs from the startup of the service and after it fully booted up in this gist.
One log message caught my eye, which is:
linkerd_stack::switch_ready: Switching to secondary service after=500ms
What does that mean? I couldn’t find anything in the repo.We currently have a version running of this service without any issues which only has one port that serves
grpc
(8081 no TLS) and is running this proxy imageghcr.io/linkerd/proxy:stable-2.9.0
.Hope this helps.