linkerd2: Error from gRPC .net client throw error "proxy max-concurrency exhausted"

Bug Report

From few days I have started receiving such error: Grpc.Core.RpcException: Status(StatusCode=Unavailable, Detail=“proxy max-concurrency exhausted”)

What is the issue?

Problem to connect from .net client to the python gRPC server on kubernetes using linkerd as service mesh

How can it be reproduced?

I can’t reproduce this by myself only happen on the production environment in last few days

Logs, error output, etc

(If the output is long, please create a gist and paste the link here.)

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-addons
--------------
√ 'linkerd-config-addons' config map exists

linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running

Status check results are √

Environment

  • Kubernetes Version: 1.15.12
  • Cluster Environment: AKS
  • Host OS: Linux
  • Linkerd version: 2.8.1 (stable)

Possible solution

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 7
  • Comments: 45 (18 by maintainers)

Commits related to this issue

Most upvoted comments

While debugging this today, we noticed something surprising: this error message is completely misleading. The underlying error doesn’t actually indicate that the service is at capacity (so we’ve been trying debug a very different situation). Instead, this error signals, simply, that the proxy is in failfast. A service may enter failfast whenever it is unable to process requests for some amount of time—outbound=3s, inbound=1s. A load balancer, for instance, may have no endpoints; or, when no balancer is present, an individual endpoint may be down or otherwise not processing requests. The failfast mechanism is primarily intended to prevent the proxy from buffering requests indefinitely. If the service has been stuck in an unavailable state, we start serving error response immediately, rather than waiting a full timeout before the request fails.

Recently (since 2.9.x), we’ve updated the proxy to include more descriptive failfast error messages, to indicate which layer of the proxy is in failfast. I’ve opened a PR to replace “max concurrency exhausted” with this more descriptive error message (https://github.com/linkerd/linkerd2-proxy/pull/847); and we’ll revisit the data we have in light of this revelation.

This is a high priority for us to address before stable-2.10.

I’m still occasionally seeing max-concurrency errors on 2.9.1. It seems to happen when k8s removes pods which are serving active requests.

We’ve merged linkerd/linkerd2-proxy#847 and linkerd/linkerd2-proxy#848 to main to help improve failfast diagnostics. I’ve built and pushed a container image for the proxy that includes these changes. It can be used with control planes from 2.9.x or recent edge releases.

spec:
  template:
    metadata:
      annotations:
        config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
        config.linkerd.io/proxy-version: main.79c95af8

Is there any documentation with some general guidance on which ports should skip the proxy?

https://linkerd.io/2/features/protocol-detection/ – we ship with a standard set of skip ports to get many common cases; but clearly we don’t catch all of them.

Initially, I was thinking it made sense to have the proxy handle all traffic to gain the benefits of auto mTLS and span-id injection for tracing. But now I’m realizing that a lot of traffic (like databases) already have built-in TLS solutions, and don’t use http/http2 protocols so they aren’t compatible with span-id injection.

Correct, tracing doesn’t really work with arbitrary TCP protocols.

As for mTLS, the opaque ports feature I mentioned is intended to allow the proxy to transport arbitrary protocols over mTLS. This currently only works with resources that are running in your cluster (so if you’re using RDS from your cloud provider, that won’t work here). Though, it could potentially work for your cassandra use case.

Should linkerd-proxy connections be limited to http/http2 traefik?

Linkerd adds the most value for HTTP. However, in 2.9 we started supporting more features (mTLS, traffic split, etc) for non-HTTP traffic. In 2.10, we’ll introduce the new opaque ports feature to broaden this support to non-detectable protocols; and we’ll also start supporting non-HTTP traffic in multicluster configurations.

I found some time to play with marlin.

I updated its linkerd-proxy to skip outbound port 9042 for Cassandra, and also updated the linkerd-proxy on Cassandra to skip inbound on 9042. Now marlin is starting successfully.

Is there any documentation with some general guidance on which ports should skip the proxy?

Initially, I was thinking it made sense to have the proxy handle all traffic to gain the benefits of auto mTLS and span-id injection for tracing. But now I’m realizing that a lot of traffic (like databases) already have built-in TLS solutions, and don’t use http/http2 protocols so they aren’t compatible with span-id injection.

Should linkerd-proxy connections be limited to http/http2 traefik?

I’ve verified that egret doesn’t start listening on any ports until after verifying the database connection.

Instructing the linkerd-proxy to skip outbound port 5432 allows the service to startup successfully.

I might not have a chance to look at marlin until tomorrow.

For us at least part of the issue was actually in our code where we initialised our grpc server with the setting

// MaxConcurrentStreams returns a ServerOption that will apply a limit on the number
// of concurrent streams to each ServerTransport.
func MaxConcurrentStreams(n uint32) ServerOption {
	return newFuncServerOption(func(o *serverOptions) {
		o.maxConcurrentStreams = n
	})
}

where n was 1. 🤦 So, no wonder really that after a while the linkerd proxy can’t establish connections anymore. Removing this server option solved the issue at hand and the proxy has been working smoothly ever since.

We saw another service where this was happening yesterday which didn’t have this grpc server setting though. We will continue to investigate this next week.