linkerd2: lInkerd-proxy connection refused/closed/reset errors.
Bug Report
What is the issue?
I am working on finding the root cause for an increased connection drops/failures on our linkerd injected apps which happens rarely and noticed a lot of connection related errors (Connection closed error=Service in fail-fast”
Connection reset by peer (os error 104)
Connection refused (os error 111)
Transport endpoint is not connected (os error 107)
etc…) in the proxy log on all our services and I am wondering if linkerd has got anything to do with the increased failures in our apps. I’m kind of stuck in finding the root cause, so I would really appreciate if someone can explain more about these connection errors.
How can it be reproduced?
I am not able to reproduce the issue, the issue (increased connection failures) rarely happens and there are a lot of tcp connection closed/refused errors in the proxy logs as seen below.
Logs, error output, etc
"Dec 2, 2020 @ 06:45:47.043","[ 80905.816738s] INFO ThreadId(01) outbound:accept{peer.addr=10.244.90.101:39120 target.addr=10.244.57.132:9092}: linkerd2_app_core::serve: Connection closed error=Service in fail-fast"
"Dec 2, 2020 @ 06:45:47.042","[ 80905.241029s] INFO ThreadId(01) outbound:accept{peer.addr=10.244.90.101:34980 target.addr=10.244.54.131:9092}: linkerd2_app_core::serve: Connection closed error=Service in fail-fast"
"Dec 2, 2020 @ 06:45:51.858","[ 10.069981s] WARN ThreadId(01) inbound:accept{peer.addr=10.244.33.105:37338 target.addr=10.244.33.169:80}: linkerd2_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)"
"Dec 2, 2020 @ 06:45:51.858","[ 11.203328s] WARN ThreadId(01) inbound:accept{peer.addr=10.244.33.1:46326 target.addr=10.244.33.169:80}: linkerd2_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)"
"Dec 2, 2020 @ 06:45:51.858","[ 22.103038s] INFO ThreadId(01) outbound:accept{peer.addr=10.244.33.169:49660 target.addr=10.244.59.53:9092}: linkerd2_app_core::serve: Connection closed error=Service in fail-fast"
[ 4784.452349s] INFO ThreadId(01) inbound:accept{peer.addr=10.244.44.107:44614 target.addr=10.244.44.106:80}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[ 5089.706325s] INFO ThreadId(01) inbound:accept{peer.addr=10.244.44.107:44636 target.addr=10.244.44.106:80}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[ 5124.234815s] INFO ThreadId(01) inbound:accept{peer.addr=10.244.44.107:46202 target.addr=10.244.44.106:80}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[ 5318.467170s] INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:60420 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[ 5329.525274s] INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:59506 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[ 5331.837249s] INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:58566 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
linkerd check
output
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running
linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date
control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match
Status check results are √
Environment
- Kubernetes Version: v1.18.8
- Cluster Environment: Self hosted, Deployed via kubeadm
- Host OS: Ubuntu 18.04.2 LTS
- Linkerd version: stable-2.9.0 (deployed using helm chart)
Possible solution
Additional context
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 8
- Comments: 37 (16 by maintainers)
Commits related to this issue
- server: Log connection closed messages at DEBUG Per linkerd/linkerd2#5331, it seems that "Connection closed" log messages, which in many cases are compltely innocuous, cause concern when people see t... — committed to linkerd/linkerd2-proxy by olix0r 3 years ago
- server: Log connection closed messages at DEBUG (#931) Per linkerd/linkerd2#5331, it seems that "Connection closed" log messages, which in many cases are compltely innocuous, cause concern when peo... — committed to linkerd/linkerd2-proxy by olix0r 3 years ago
I’ve reproduced steps from this comment and it still produces the same errors. 😦
It’s RDS, lives not in Kubernetes. Meanwhile I added RDS and some other ports to skipped. Linkerd upgrade requires cluster downtime, so we cannot do this too often.
I am also seeing a good amount of these logs. I upgraded our staging environment to
stable-2.9.1
in hopes of resolution, but to no avail.Here is a sample of logs in one of our
linkerd-proxy
containers:Our PostgreSQL instance is through RDS. The error that prompted me to start looking into this was a broken pipe error in our own services, though I’m wondering if the broken pipe log from the proxy is related. I have not been able to reproduce the error in our own services so that I can see it in action.
Hoping that my info that this is still occurring on
stable-2.9.1
helps diagnose this.Thank you @adleong 👍🏿 I went through the logs and it seems like we are having these messages for non-headless service targets as well.
The
target.addr
(PostgreSQL) is a non-headless service.I will plan for an update once we have the
stable-2.9.1
patch release available and will keep you posted on this. Thanks! 🙂@christianhuening mentioned to me that this may be related to node-level conntrack limits… would be interesting to see if there are any dmesg logs indicating the conntrack table is full when this is observed.
@olix0r We have noticed an occurrence of 502 errors from linkerd after we upgraded to 2.9.4.
Frequency of 502s were very low, but consistent enough (1~2 times every hour). Compared to that, we almost NEVER had 502s in 2.8.1.
Seems like 502 errors did not correlate with log connection closed errors. Although, occurrences were very low, with so many “log connection closed” messages, we decided to roll back to 2.8.1.
@adleong fwiw stable-2.9.3 has substantially reduced these errors and the signal to noise ratio is vastly improved but we are still seeing
Connection closed error=connection error: Transport endpoint is not connected (os error 107)
. There doesn’t appear to be any pattern (no specific ports or services). Unfortunately, upgrading to an edge release isn’t something we can do.Thanks, @ipaleka. We’ll try to take a look at this before the next stable release.
Hi @olix0r !
I suppose the same error I’ve got using Zalando’s postgres-operator (spilo from above suggests that). If you have kind installed or have somewhere a cluster used for testing in the default namespace, here are the steps to reproduce the same error:
And the output is a plenty of lines such:
2021-01-13 21:33:15,559 ERROR: Request to server https://172.18.0.4:6443 failed: MaxRetryError("HTTPSConnectionPool(host='172.18.0.4', port=6443): Max retries exceeded with url: /api/v1/namespaces/default/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dacid-minimal-cluster (Caused by ProtocolError('Connection aborted.', OSError(0, 'Error')))",)
The same happens with the last edge too.
It won’t happen with stable 2.8.1, but I suppose that’s because of mTLS functionality - injecting the namespace in that version won’t automatically affect created pods.
Edit: forgot to add the repository cloning command…
@olix0r With regard to my comment, the only times this seems to have manifested as an application error is the os error 32.
Whenever those pop up they align with the application errors (also broken pipe).