linkerd2: load balancer discovery error: discovery task failed

Bug Report

What is the issue?

At some point linkerd-proxy can’t reach out some service (may be it was really disappeared) and then it can’t connect to it until proxy restart.

How can it be reproduced?

Not sure.

Logs, error output, etc

linkerd-proxy logs:

WARN [188674.700387s] outbound:accept{peer.addr=10.0.6.112:59298}:source{target.addr=10.0.17.202:80}:logical{addr=service-name:80}:making:profile:balance{addr=service-name.default.svc.cluster.local:80}: linkerd2_proxy_discover::buffer dropping resolution due to watchdog timeout timeout=60s
ERR! [262801.414146s] outbound:accept{peer.addr=10.0.6.112:59282}:source{target.addr=10.0.17.197:80}: linkerd2_app_core::errors unexpected error: Inner("load balancer discovery error: discovery task failed")
ERR! [262857.738884s] outbound:accept{peer.addr=10.0.6.112:33972}:source{target.addr=10.0.17.197:80}: linkerd2_app_core::errors unexpected error: Inner("load balancer discovery error: discovery task failed")
ERR! [262959.610891s] outbound:accept{peer.addr=10.0.6.112:39410}:source{target.addr=10.0.17.197:80}: linkerd2_app_core::errors unexpected error: Inner("load balancer discovery error: discovery task failed")

linkerd check output

Status check results are √

Environment

  • Kubernetes Version: 1.15.5
  • Cluster Environment: AKS
  • Host OS: Ubuntu 16.04.6 LTS
  • Linkerd version: edge-20.1.1

Possible solution

Additional context

Service that proxy can’t reach out is service without selector and it’s not meshed.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 10
  • Comments: 30 (17 by maintainers)

Most upvoted comments

We’re also having this issue. It happens intermittently for some pods. When you try to access the other service directly using their Cluster IP, you get a response, however, if you use the service’s name, you get a 502 Bad Gateway error.

The LinkerD proxy sidecar shows the following logs:

[ 59763.318176661s]  WARN outbound:accept{peer.addr=100.110.0.6:57546}:source{target.addr=100.66.25.52:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed
[ 60010.715741422s]  WARN outbound:accept{peer.addr=100.110.0.6:60360}:source{target.addr=100.66.25.52:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed

We experienced these same symptoms when the default memory limit that shipped with helm was too low for the control plane components on our clusters of > 1k pods (smaller clusters were fine, and didn’t demonstrate this problem). increasing the limit from 250Mi to 500Mi on both the destination & controller deployments seems to have helped.

In our test env we also saw this in stable-2.7.0:

[ 81965.930575036s]  WARN outbound:accept{peer.addr=10.0.0.157:50678}:source{target.addr=172.20.192.163:8100}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed
[ 81977.951660481s]  WARN outbound:accept{peer.addr=10.0.0.157:58822}:source{target.addr=172.20.192.163:8100}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed

We are not using cert-manager integration.

Rolled back to 2.6.1.