linkerd2: Intermittent 502 Bad Gateway issue when service is meshed
Bug Report
What is the issue?
Without linkerd proxy, there are no trace of 502 bad gateway at the ingress level nor the app level With linkerd proxy enabled on nginx ingress and or app, intermittent 502 Bad Gateway appears in the system.
I see 2 types of error from the proxy:
## for this one the requests actually made it through the upstream
linkerd2_app_core::errors: Failed to proxy request: connection closed before message completed
and
## for this one the requests never made it to the upstream
linkerd2_app_core::errors: Failed to proxy request: connection error: Connection reset by peer (os error 104)
both lead to (on ingress)
"GET /v1/users?ids=xxxxxxx,xxxxx HTTP/1.1" 502 0 "-" "python-requests/2.23.0" 1708 0.097 [xxxxx-app-http] [] 10.4.27.106:80 0 0.100 502 7543fb48ce22d5c6145b97daadde93d9
or (on an app)
## Here I suspect the connection timeout that could be tuned to allow more time to the proxy to establish the connection. I manage to prove that by giving more resources to datadog agent daemonset
Failed to send traces to Datadog Agent at http://10.4.87.60:8126: HTTP error status 502, reason Bad Gateway, message content-length: 0
date: Wed, 12 Aug 2020 17:58:22 GMT
I have eliminated the scale-up and scale-down event of ingress and app. I also ruled out the graceful termination which I gave a period of 30s on app and 40s on ingress for the sleep on the proxy.
How can it be reproduced?
Logs, error output, etc
here are some linkerd-debug logs where 10.3.45.50 is the ingress controller pod and 10.3.23.252 is the upstream server pod
1829200 2893.254017273 127.0.0.1 → 127.0.0.1 TCP 68 443 → 57590 [ACK] Seq=3277 Ack=1894 Win=175744 Len=0 TSval=378310645 TSecr=378310644
1829201 2893.254172427 10.3.45.50 → 127.0.0.1 HTTP 1701 GET /v1/users/xxxxxxxx?include=active_contract HTTP/1.1
1829202 2893.254471803 10.3.45.50 → 10.3.23.252 TCP 76 47414 → 80 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=2090260178 TSecr=0 WS=128
1829203 2893.254837416 10.3.23.252 → 10.3.45.50 TCP 56 80 → 47414 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
1829204 2893.254974304 10.3.23.252 → 10.3.45.50 HTTP 152 HTTP/1.1 502 Bad Gateway
I’ve got similar tracing with tcp_dump with ksniff kubectl plugin.
linkerd check
output
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
issuer certificate will expire on 2020-08-13T08:39:27Z
see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running
linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date
control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match
linkerd-addons
--------------
√ 'linkerd-config-addons' config map exists
Environment
- Kubernetes Version: 1.16, 1.17
- Cluster Environment: EKS
- Host OS: Amazon Linux 2 (eks optimized ami)
- Linkerd version: 2.8.1
- nginx ingress version: 0.34.1
- upstream app: python 3.7 with hypercorn server
nginx ingress service annotations
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "60"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*'
service.beta.kubernetes.io/aws-load-balancer-type: elb
nginx ingress deployment annotation
config.alpha.linkerd.io/proxy-wait-before-exit-seconds: 40
linkerd.io/inject: enabled
Possible solution
Tuning all timeouts per app/service with proxy annotation and also globally with global configuration from chart deployment.
- keep-alive
- connection
- read timeout (this one I didn’t see in linkerd but similar to nginx)
Additional context
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 27 (12 by maintainers)
@mbelang
I want to make sure that we take a deeper look at issues like this before we cut another stable release.
@olix0r I know that retries would not solve the issue but would at least mitigate. I will plan an upgrade to 2.9.1 and see how it goes from there.
I did try to put the proxy in Debug mode but I didn’t manage to get more information that I posted here. Maybe I missed it but it is a fairly rare event that it is very hard to catch and I do not want to debug that in production cluster though I suspect that the elasticity of the production cluster could have an impact on the issue.
@mbelang reports that this problem goes away when all pods are meshed; so this points strongly to the HTTP/1.1 client.
@mbelang and I had a chance to talk through this issue in Slack this morning. I think we have a good enough handle on it to put together a repro setup:
Then, we should try putting consistent load on the ingress. Ideally, we’d test this all on EKS with the latest AWS CNI, as it seems plausible that it’s a bad interaction at the network layer.
If we can reproduce this with this kind of setup, then I think it should be pretty straightforward to diagnose/fix. If we can’t, we can start digging into more details about how this repro setup differs from @mbelang’s actual system.
To help further narrow down the repro steps, do these 502s happen only when the datadog agent is the target service?