linkerd2: Error from gRPC .net client throw error "proxy max-concurrency exhausted"
Bug Report
From few days I have started receiving such error: Grpc.Core.RpcException: Status(StatusCode=Unavailable, Detail=“proxy max-concurrency exhausted”)
What is the issue?
Problem to connect from .net client to the python gRPC server on kubernetes using linkerd as service mesh
How can it be reproduced?
I can’t reproduce this by myself only happen on the production environment in last few days
Logs, error output, etc
(If the output is long, please create a gist and paste the link here.)
linkerd check
output
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running
linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date
control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match
linkerd-addons
--------------
√ 'linkerd-config-addons' config map exists
linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running
Status check results are √
Environment
- Kubernetes Version: 1.15.12
- Cluster Environment: AKS
- Host OS: Linux
- Linkerd version: 2.8.1 (stable)
Possible solution
Additional context
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 7
- Comments: 45 (18 by maintainers)
Commits related to this issue
- Ensure services in failfast can become ready When a Service is in failfast, the inner service is only polled as new requests are processed. This means it's theoretically possible for certain service ... — committed to linkerd/linkerd2-proxy by olix0r 3 years ago
- Ensure services in failfast can become ready When a Service is in failfast, the inner service is only polled as new requests are processed. This means it's theoretically possible for certain service ... — committed to linkerd/linkerd2-proxy by olix0r 3 years ago
- Ensure services in failfast can become ready (#858) When a Service is in failfast, the inner service is only polled as new requests are processed. This means it's theoretically possible for certain... — committed to linkerd/linkerd2-proxy by olix0r 3 years ago
While debugging this today, we noticed something surprising: this error message is completely misleading. The underlying error doesn’t actually indicate that the service is at capacity (so we’ve been trying debug a very different situation). Instead, this error signals, simply, that the proxy is in failfast. A service may enter failfast whenever it is unable to process requests for some amount of time—outbound=3s, inbound=1s. A load balancer, for instance, may have no endpoints; or, when no balancer is present, an individual endpoint may be down or otherwise not processing requests. The failfast mechanism is primarily intended to prevent the proxy from buffering requests indefinitely. If the service has been stuck in an unavailable state, we start serving error response immediately, rather than waiting a full timeout before the request fails.
Recently (since 2.9.x), we’ve updated the proxy to include more descriptive failfast error messages, to indicate which layer of the proxy is in failfast. I’ve opened a PR to replace “max concurrency exhausted” with this more descriptive error message (https://github.com/linkerd/linkerd2-proxy/pull/847); and we’ll revisit the data we have in light of this revelation.
This is a high priority for us to address before stable-2.10.
I’m still occasionally seeing max-concurrency errors on 2.9.1. It seems to happen when k8s removes pods which are serving active requests.
We’ve merged linkerd/linkerd2-proxy#847 and linkerd/linkerd2-proxy#848 to main to help improve failfast diagnostics. I’ve built and pushed a container image for the proxy that includes these changes. It can be used with control planes from 2.9.x or recent edge releases.
https://linkerd.io/2/features/protocol-detection/ – we ship with a standard set of skip ports to get many common cases; but clearly we don’t catch all of them.
Correct, tracing doesn’t really work with arbitrary TCP protocols.
As for mTLS, the opaque ports feature I mentioned is intended to allow the proxy to transport arbitrary protocols over mTLS. This currently only works with resources that are running in your cluster (so if you’re using RDS from your cloud provider, that won’t work here). Though, it could potentially work for your cassandra use case.
Linkerd adds the most value for HTTP. However, in 2.9 we started supporting more features (mTLS, traffic split, etc) for non-HTTP traffic. In 2.10, we’ll introduce the new opaque ports feature to broaden this support to non-detectable protocols; and we’ll also start supporting non-HTTP traffic in multicluster configurations.
I found some time to play with
marlin
.I updated its linkerd-proxy to skip outbound port 9042 for Cassandra, and also updated the linkerd-proxy on Cassandra to skip inbound on 9042. Now marlin is starting successfully.
Is there any documentation with some general guidance on which ports should skip the proxy?
Initially, I was thinking it made sense to have the proxy handle all traffic to gain the benefits of auto mTLS and span-id injection for tracing. But now I’m realizing that a lot of traffic (like databases) already have built-in TLS solutions, and don’t use http/http2 protocols so they aren’t compatible with span-id injection.
Should linkerd-proxy connections be limited to http/http2 traefik?
I’ve verified that
egret
doesn’t start listening on any ports until after verifying the database connection.Instructing the linkerd-proxy to skip outbound port
5432
allows the service to startup successfully.I might not have a chance to look at
marlin
until tomorrow.For us at least part of the issue was actually in our code where we initialised our grpc server with the setting
where
n
was1
. 🤦 So, no wonder really that after a while the linkerd proxy can’t establish connections anymore. Removing this server option solved the issue at hand and the proxy has been working smoothly ever since.We saw another service where this was happening yesterday which didn’t have this grpc server setting though. We will continue to investigate this next week.