linkerd2: load balancer discovery error: discovery task failed
Bug Report
What is the issue?
At some point linkerd-proxy can’t reach out some service (may be it was really disappeared) and then it can’t connect to it until proxy restart.
How can it be reproduced?
Not sure.
Logs, error output, etc
linkerd-proxy logs:
WARN [188674.700387s] outbound:accept{peer.addr=10.0.6.112:59298}:source{target.addr=10.0.17.202:80}:logical{addr=service-name:80}:making:profile:balance{addr=service-name.default.svc.cluster.local:80}: linkerd2_proxy_discover::buffer dropping resolution due to watchdog timeout timeout=60s
ERR! [262801.414146s] outbound:accept{peer.addr=10.0.6.112:59282}:source{target.addr=10.0.17.197:80}: linkerd2_app_core::errors unexpected error: Inner("load balancer discovery error: discovery task failed")
ERR! [262857.738884s] outbound:accept{peer.addr=10.0.6.112:33972}:source{target.addr=10.0.17.197:80}: linkerd2_app_core::errors unexpected error: Inner("load balancer discovery error: discovery task failed")
ERR! [262959.610891s] outbound:accept{peer.addr=10.0.6.112:39410}:source{target.addr=10.0.17.197:80}: linkerd2_app_core::errors unexpected error: Inner("load balancer discovery error: discovery task failed")
linkerd check
output
Status check results are √
Environment
- Kubernetes Version: 1.15.5
- Cluster Environment: AKS
- Host OS: Ubuntu 16.04.6 LTS
- Linkerd version: edge-20.1.1
Possible solution
Additional context
Service that proxy can’t reach out is service without selector and it’s not meshed.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 10
- Comments: 30 (17 by maintainers)
We’re also having this issue. It happens intermittently for some pods. When you try to access the other service directly using their Cluster IP, you get a response, however, if you use the service’s name, you get a 502 Bad Gateway error.
The LinkerD proxy sidecar shows the following logs:
We experienced these same symptoms when the default memory limit that shipped with helm was too low for the control plane components on our clusters of > 1k pods (smaller clusters were fine, and didn’t demonstrate this problem). increasing the limit from 250Mi to 500Mi on both the destination & controller deployments seems to have helped.
In our test env we also saw this in
stable-2.7.0
:We are not using cert-manager integration.
Rolled back to 2.6.1.