linkerd2: 502 Bad Gateway from proxy on retry after timeout

Bug Report

What is the issue?

I call an external http/1.1 service from a linkerd2-injected container. The service takes a while to execute each request, and often the client side times out and retries.

The behavior I’m seeing is

  1. Request made
  2. Request times out
  3. 1 second later, my container retries the request again, and
  4. linkerd proxy immediately returns 502 Bad Gateway to the container

How can it be reproduced?

I don’t have a full reproduction, however in my environment I can easily reproduce it by ensuring a timeout occurs by blocking the external service from responding.

Logs, error output, etc

app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.815158241Z DBUG [210207.178386s] x-ops-fevm.azure.x.com:7071 linkerd2_proxy::app::outbound outbound dst=Some(DstAddr { dst_logical: Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }), dst_concrete: Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }), direction: Out, http_settings: Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false } })
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.815356042Z DBUG [210207.178566s] x-ops-fevm.azure.x.com:7071 linkerd2_proxy::app::outbound outbound ep=Some(Endpoint { dst_logical: None, dst_concrete: None, addr: V4(10.254.2.5:7071), identity: None(NoPeerName(NotProvidedByServiceDiscovery)), metadata: Metadata { weight: 10000, labels: {}, protocol_hint: Unknown, identity: None }, http_settings: Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false } })
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.815472742Z DBUG [210207.178643s] 10.254.2.5:7071 linkerd2_proxy::proxy::http::client client request: method=POST uri=http://x-ops-fevm.azure.x.com:7071/api/p version=HTTP/1.1 headers={"host": "x-ops-fevm.azure.x.com:7071", "accept": "application/json", "accept-charset": "UTF-8", "user-agent": "Ktor client", "content-type": "application/pdf", "content-length": "9984137", "l5d-dst-canonical": "x-ops-fevm.azure.x.com:7071"}
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.815498442Z DBUG [210207.178721s] linkerd2_proxy::transport::connect connecting to 10.254.2.5:7071
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.817578849Z DBUG [210207.180791s] linkerd2_proxy::transport::connect connection established to 10.254.2.5:7071
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.817596449Z DBUG [210207.180878s] linkerd2_proxy::transport::metrics client connection open
[... bunch of linkerd2_proxy::transport::tls::listen ...]
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:18:01.392623774Z DBUG [210314.755744s] linkerd2_proxy::proxy::server http1 server error: Error(BodyWrite, Os { code: 32, kind: BrokenPipe, message: "Broken pipe" })
[... bunch of linkerd2_proxy::transport::tls::listen ...]
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:19:15.452792Z DBUG [210388.815977s] linkerd2_proxy::proxy::tcp forward duplex complete: Connection reset by peer (os error 104)
[... bunch of linkerd2_proxy::transport::tls::listen ...]
app-55db9ff5bf-sthkq app 2019-10-17T16:21:13.8160006Z 2019-10-17 16:21:13,815 WARN  --- [ty-call-handler#2156] MyApp          : Long-running call attempt 0 failed, retrying... Timed out waiting for 300000 ms [[EXCEPTION: kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 300000 ms|       ... suppressed 5 lines| at java.lang.Thread.run(Thread.java:834) [?:?]|]]
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.821686419Z DBUG [210508.184840s] linkerd2_proxy::transport::tls::listen accepted connection from 10.5.1.125:35288 to Some(V4(10.254.2.5:7071)); skipping TLS (loopback)
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.822822123Z DBUG [210508.186064s] linkerd2_proxy::app::outbound outbound addr=Ok(Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }))
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.822892923Z DBUG [210508.186148s] x-ops-fevm.azure.x.com:7071 linkerd2_proxy::app::outbound outbound dst=Some(DstAddr { dst_logical: Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }), dst_concrete: Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }), direction: Out, http_settings: Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false } })
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.823080324Z DBUG [210508.186321s] x-ops-fevm.azure.x.com:7071 linkerd2_proxy::app::outbound outbound ep=Some(Endpoint { dst_logical: None, dst_concrete: None, addr: V4(10.254.2.5:7071), identity: None(NoPeerName(NotProvidedByServiceDiscovery)), metadata: Metadata { weight: 10000, labels: {}, protocol_hint: Unknown, identity: None }, http_settings: Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false } })
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.823131324Z DBUG [210508.186397s] 10.254.2.5:7071 linkerd2_proxy::proxy::http::client client request: method=POST uri=http://x-ops-fevm.azure.x.com:7071/api/p version=HTTP/1.1 headers={"host": "x-ops-fevm.azure.x.com:7071", "accept": "application/json", "accept-charset": "UTF-8", "user-agent": "Ktor client", "content-type": "application/pdf", "content-length": "9984137", "l5d-dst-canonical": "x-ops-fevm.azure.x.com:7071"}
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.823202624Z DBUG [210508.186450s] linkerd2_proxy::transport::connect connecting to 10.254.2.5:7071
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.82501623Z DBUG [210508.188244s] linkerd2_proxy::transport::connect connection established to 10.254.2.5:7071
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.82503323Z DBUG [210508.188303s] linkerd2_proxy::transport::metrics client connection open
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.825265131Z DBUG [210508.188420s] proxy={client=out dst=10.254.2.5:7071 proto=Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false }} linkerd2_proxy::proxy::http::glue http body error: error reading a body from connection: end of file before message length reached
app-55db9ff5bf-sthkq linkerd-proxy app-55db9ff5bf-sthkq2019-10-17T16:21:14.825554932Z ERR! [210508.188673s] linkerd2_proxy::app::errors unexpected error: error from user's Payload stream: error reading a body from connection: end of file before message length reached
app 2019-10-17T16:21:14.827497238Z 2019-10-17 16:21:14,826 WARN  --- [ty-call-handler#2156] MyApp          : Long-running call attempt 1 failed, retrying... Unable to do long running call, status=[502 Bad Gateway]

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ no invalid service profiles

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

Environment

  • Kubernetes Version: 1.12.7
  • Cluster Environment: AKS (via AKS-engine)
  • Host OS: Ubuntu 16.04.6 LTS
  • Linkerd version:
$ linkerd version
Client version: stable-2.6.0
Server version: stable-2.6.0

Possible solution

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 19 (13 by maintainers)

Most upvoted comments

(I investigated nodejs http/socket library that elasticsearch-js internally uses. However, It seems the library does not have half-close behavior)

I don’t think it’s just the client that will/can cause a half-close connection. Not an ES expert, but there are a number of these ES half-close connection issues out there: https://github.com/elastic/elasticsearch-php/issues/225, https://discuss.elastic.co/t/logstash-output-elasticsearch-connection-reset-while-trying-to-send-bulk-request-to-elasticsearch/173941, https://discuss.elastic.co/t/transportclient-disconnecting-and-not-reconnecting-automatically/13791, https://stackoverflow.com/q/49844295/1144203. They sound very similar to what you are seeing, right? And all point to different factors that can contribute to half-close connections.