linkerd2: Intermittent 502 Bad Gateway issue when service is meshed

Bug Report

What is the issue?

Without linkerd proxy, there are no trace of 502 bad gateway at the ingress level nor the app level With linkerd proxy enabled on nginx ingress and or app, intermittent 502 Bad Gateway appears in the system.

I see 2 types of error from the proxy:

## for this one the requests actually made it through the upstream
linkerd2_app_core::errors: Failed to proxy request: connection closed before message completed

and

## for this one the requests never made it to the upstream
linkerd2_app_core::errors: Failed to proxy request: connection error: Connection reset by peer (os error 104)

both lead to (on ingress)

"GET /v1/users?ids=xxxxxxx,xxxxx HTTP/1.1" 502 0 "-" "python-requests/2.23.0" 1708 0.097 [xxxxx-app-http] [] 10.4.27.106:80 0 0.100 502 7543fb48ce22d5c6145b97daadde93d9

or (on an app)

## Here I suspect the connection timeout that could be tuned to allow more time to the proxy to establish the connection. I manage to prove that by giving more resources to datadog agent daemonset

Failed to send traces to Datadog Agent at http://10.4.87.60:8126: HTTP error status 502, reason Bad Gateway, message content-length: 0
date: Wed, 12 Aug 2020 17:58:22 GMT

I have eliminated the scale-up and scale-down event of ingress and app. I also ruled out the graceful termination which I gave a period of 30s on app and 40s on ingress for the sleep on the proxy.

How can it be reproduced?

Logs, error output, etc

here are some linkerd-debug logs where 10.3.45.50 is the ingress controller pod and 10.3.23.252 is the upstream server pod

1829200 2893.254017273    127.0.0.1 → 127.0.0.1    TCP 68 443 → 57590 [ACK] Seq=3277 Ack=1894 Win=175744 Len=0 TSval=378310645 TSecr=378310644
1829201 2893.254172427   10.3.45.50 → 127.0.0.1    HTTP 1701 GET /v1/users/xxxxxxxx?include=active_contract HTTP/1.1 
1829202 2893.254471803   10.3.45.50 → 10.3.23.252  TCP 76 47414 → 80 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=2090260178 TSecr=0 WS=128
1829203 2893.254837416  10.3.23.252 → 10.3.45.50   TCP 56 80 → 47414 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
1829204 2893.254974304  10.3.23.252 → 10.3.45.50   HTTP 152 HTTP/1.1 502 Bad Gateway 

I’ve got similar tracing with tcp_dump with ksniff kubectl plugin.

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2020-08-13T08:39:27Z
    see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-addons
--------------
√ 'linkerd-config-addons' config map exists

Environment

  • Kubernetes Version: 1.16, 1.17
  • Cluster Environment: EKS
  • Host OS: Amazon Linux 2 (eks optimized ami)
  • Linkerd version: 2.8.1
  • nginx ingress version: 0.34.1
  • upstream app: python 3.7 with hypercorn server

nginx ingress service annotations

    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "60"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*'
    service.beta.kubernetes.io/aws-load-balancer-type: elb

nginx ingress deployment annotation

   config.alpha.linkerd.io/proxy-wait-before-exit-seconds: 40
   linkerd.io/inject: enabled

Possible solution

Tuning all timeouts per app/service with proxy annotation and also globally with global configuration from chart deployment.

  • keep-alive
  • connection
  • read timeout (this one I didn’t see in linkerd but similar to nginx)

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 27 (12 by maintainers)

Most upvoted comments

@mbelang

any reason you tagged the issue for 2.10 release?

I want to make sure that we take a deeper look at issues like this before we cut another stable release.

@olix0r I know that retries would not solve the issue but would at least mitigate. I will plan an upgrade to 2.9.1 and see how it goes from there.

I did try to put the proxy in Debug mode but I didn’t manage to get more information that I posted here. Maybe I missed it but it is a fairly rare event that it is very hard to catch and I do not want to debug that in production cluster though I suspect that the elasticity of the production cluster could have an impact on the issue.

@mbelang reports that this problem goes away when all pods are meshed; so this points strongly to the HTTP/1.1 client.

@mbelang and I had a chance to talk through this issue in Slack this morning. I think we have a good enough handle on it to put together a repro setup:

  • nginx ingress, injected with proxy
  • app with python HTTP server, uninjected

Then, we should try putting consistent load on the ingress. Ideally, we’d test this all on EKS with the latest AWS CNI, as it seems plausible that it’s a bad interaction at the network layer.

If we can reproduce this with this kind of setup, then I think it should be pretty straightforward to diagnose/fix. If we can’t, we can start digging into more details about how this repro setup differs from @mbelang’s actual system.

but I do have problems with a meshed app trying to reach out to datadog agent.

To help further narrow down the repro steps, do these 502s happen only when the datadog agent is the target service?