linkerd2: lInkerd-proxy connection refused/closed/reset errors.

Bug Report

What is the issue?

I am working on finding the root cause for an increased connection drops/failures on our linkerd injected apps which happens rarely and noticed a lot of connection related errors (Connection closed error=Service in fail-fast” Connection reset by peer (os error 104) Connection refused (os error 111) Transport endpoint is not connected (os error 107) etc…) in the proxy log on all our services and I am wondering if linkerd has got anything to do with the increased failures in our apps. I’m kind of stuck in finding the root cause, so I would really appreciate if someone can explain more about these connection errors.

How can it be reproduced?

I am not able to reproduce the issue, the issue (increased connection failures) rarely happens and there are a lot of tcp connection closed/refused errors in the proxy logs as seen below.

Logs, error output, etc

"Dec 2, 2020 @ 06:45:47.043","[ 80905.816738s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.90.101:39120 target.addr=10.244.57.132:9092}: linkerd2_app_core::serve: Connection closed error=Service in fail-fast"
"Dec 2, 2020 @ 06:45:47.042","[ 80905.241029s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.90.101:34980 target.addr=10.244.54.131:9092}: linkerd2_app_core::serve: Connection closed error=Service in fail-fast"
"Dec 2, 2020 @ 06:45:51.858","[    10.069981s]  WARN ThreadId(01) inbound:accept{peer.addr=10.244.33.105:37338 target.addr=10.244.33.169:80}: linkerd2_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)"
"Dec 2, 2020 @ 06:45:51.858","[    11.203328s]  WARN ThreadId(01) inbound:accept{peer.addr=10.244.33.1:46326 target.addr=10.244.33.169:80}: linkerd2_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)"
"Dec 2, 2020 @ 06:45:51.858","[    22.103038s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.33.169:49660 target.addr=10.244.59.53:9092}: linkerd2_app_core::serve: Connection closed error=Service in fail-fast"
[  4784.452349s]  INFO ThreadId(01) inbound:accept{peer.addr=10.244.44.107:44614 target.addr=10.244.44.106:80}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[  5089.706325s]  INFO ThreadId(01) inbound:accept{peer.addr=10.244.44.107:44636 target.addr=10.244.44.106:80}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[  5124.234815s]  INFO ThreadId(01) inbound:accept{peer.addr=10.244.44.107:46202 target.addr=10.244.44.106:80}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)

[  5318.467170s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:60420 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[  5329.525274s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:59506 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[  5331.837249s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:58566 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

Environment

  • Kubernetes Version: v1.18.8
  • Cluster Environment: Self hosted, Deployed via kubeadm
  • Host OS: Ubuntu 18.04.2 LTS
  • Linkerd version: stable-2.9.0 (deployed using helm chart)

Possible solution

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 8
  • Comments: 37 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Today’s edge should basically be a Release Candidate for stable-2.10; so, even if you’re unable to completely upgrade to an edge release, it would be helpful to know whether these issues persist with recent proxies.

I’ve reproduced steps from this comment and it still produces the same errors. 😦

$ linkerd version
Client version: edge-21.2.3
Server version: edge-21.2.3

Is your PostgreSQL service headless?

It’s RDS, lives not in Kubernetes. Meanwhile I added RDS and some other ports to skipped. Linkerd upgrade requires cluster downtime, so we cannot do this too often.

I am also seeing a good amount of these logs. I upgraded our staging environment to stable-2.9.1 in hopes of resolution, but to no avail.

Here is a sample of logs in one of our linkerd-proxy containers:

[515070.841395s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:43732 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[515185.917709s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:45990 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[516061.897349s]  INFO ThreadId(01) outbound:accept{peer.addr=10.0.130.174:36508 target.addr=10.0.128.58:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[516582.706765s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:45688 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[517068.129177s]  INFO ThreadId(01) outbound:accept{peer.addr=10.0.130.174:56128 target.addr=10.0.128.58:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[517373.616350s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:32954 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[517488.739828s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:35686 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Broken pipe (os error 32)
[517787.344259s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.119:60352 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[517974.381304s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:45394 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[518461.897379s]  INFO ThreadId(01) outbound:accept{peer.addr=10.0.130.174:33934 target.addr=10.0.128.58:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[518703.636162s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.119:50810 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[519482.176746s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:46966 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)

Our PostgreSQL instance is through RDS. The error that prompted me to start looking into this was a broken pipe error in our own services, though I’m wondering if the broken pipe log from the proxy is related. I have not been able to reproduce the error in our own services so that I can see it in action.

Hoping that my info that this is still occurring on stable-2.9.1 helps diagnose this.

Thank you @adleong 👍🏿 I went through the logs and it seems like we are having these messages for non-headless service targets as well.

[  5351.451279s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:48372 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)

The target.addr (PostgreSQL) is a non-headless service.

Name:              my-svc-name
Namespace:         my-svc-namespace
Labels:            application=spilo
                   spilo-role=master

Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                10.97.182.190
Port:              postgresql  5432/TCP
TargetPort:        5432/TCP
Endpoints:         10.244.24.108:5432
Session Affinity:  None
Events:            <none>

I will plan for an update once we have the stable-2.9.1 patch release available and will keep you posted on this. Thanks! 🙂

@christianhuening mentioned to me that this may be related to node-level conntrack limits… would be interesting to see if there are any dmesg logs indicating the conntrack table is full when this is observed.

@olix0r We have noticed an occurrence of 502 errors from linkerd after we upgraded to 2.9.4.

Frequency of 502s were very low, but consistent enough (1~2 times every hour). Compared to that, we almost NEVER had 502s in 2.8.1.

Seems like 502 errors did not correlate with log connection closed errors. Although, occurrences were very low, with so many “log connection closed” messages, we decided to roll back to 2.8.1.

@adleong fwiw stable-2.9.3 has substantially reduced these errors and the signal to noise ratio is vastly improved but we are still seeing Connection closed error=connection error: Transport endpoint is not connected (os error 107). There doesn’t appear to be any pattern (no specific ports or services). Unfortunately, upgrading to an edge release isn’t something we can do.

Thanks, @ipaleka. We’ll try to take a look at this before the next stable release.

Hi @olix0r !

I suppose the same error I’ve got using Zalando’s postgres-operator (spilo from above suggests that). If you have kind installed or have somewhere a cluster used for testing in the default namespace, here are the steps to reproduce the same error:

echo """# three node (two workers) cluster config
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
""" > kind-example-config.yaml

kind create cluster --config kind-example-config.yaml
linkerd install | kubectl apply -f -
kubectl wait --namespace linkerd --for=condition=ready pod --selector=linkerd.io/control-plane-component=controller --timeout=480s
kubectl get namespace default -o yaml | linkerd inject - | kubectl apply -f -

git clone https://github.com/zalando/postgres-operator.git
cd postgres-operator

kubectl create -f manifests/configmap.yaml
kubectl create -f manifests/operator-service-account-rbac.yaml
kubectl create -f manifests/postgres-operator.yaml
kubectl wait --for=condition=ready pod -l name=postgres-operator --timeout=480s
kubectl create -f manifests/minimal-postgres-manifest.yaml
kubectl wait --for=condition=ready pod -l statefulset.kubernetes.io/pod-name=acid-minimal-cluster-0 --timeout=480s

kubectl logs acid-minimal-cluster-0 -c postgres

And the output is a plenty of lines such:

2021-01-13 21:33:15,559 ERROR: Request to server https://172.18.0.4:6443 failed: MaxRetryError("HTTPSConnectionPool(host='172.18.0.4', port=6443): Max retries exceeded with url: /api/v1/namespaces/default/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dacid-minimal-cluster (Caused by ProtocolError('Connection aborted.', OSError(0, 'Error')))",)

The same happens with the last edge too.

It won’t happen with stable 2.8.1, but I suppose that’s because of mTLS functionality - injecting the namespace in that version won’t automatically affect created pods.

Edit: forgot to add the repository cloning command…

@olix0r With regard to my comment, the only times this seems to have manifested as an application error is the os error 32.

[517488.739828s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:35686 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Broken pipe (os error 32)

Whenever those pop up they align with the application errors (also broken pipe).