linkerd2: Linkerd 2.5.0: linkerd2_proxy::app::errors unexpected error: error trying to connect: No route to host (os error 113) (address: 10.10.3.181:8080)

Bug Report

What is the issue?

We have injected pod which is running for days connecting to a partially (1/3) injected deployment, which is eventually throwing the mentioned error.

How can it be reproduced?

Run a pod for days and let it talk to a deployment which is regularly restarted, which is causing new pods and new IP addresses etc.

Logs, error output, etc

linkerd-proxy ERR! [589538.085822s] linkerd2_proxy::app::errors unexpected error: error trying to connect: No route to host (os error 113) (address: 10.10.3.181:8080)

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ no invalid service profiles

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

Environment

  • Kubernetes Version: 1.15.2
  • Cluster Environment: custom
  • Host OS: CoreOS 2191.5.0
  • Linkerd version: 2.5.0

Possible solution

Additional context

To me, not knowing all details, it looks like that the proxy is not “refreshing” the endpoints for the service and eventually just runs out of ip addresses. For us it would be fine if the proxy would exit and let the pod get restarted.

Also: linkerd is pretty awesome, thanks for all your effort you put into it!

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 57 (32 by maintainers)

Most upvoted comments

Newer versions of Linkerd (e.g., edge-20.3.4) have been updated to handle service discovery differently. If you’re still experiencing these issues, I recommend annotating your workload with config.linkerd.io/proxy-version: edge-20.3.4. If you test this, please report back! These changes will be released soon in stable-2.7.1.

We are experiencing a similar issue, but we don’t have any No route to host in our logs.

Running an nginx-ingress (That is not meshed) that forwards to a Kong 1.2.0 api-gateway (That is meshed). I have noticed getting 503 errors from some of the api-gateway pods when curling our other apis from that pod (Simple rest apis running scala applications). The other apis are reached successfully from each other (And from other api-gateway pods). When i get the 503 errors the proxy of the api-gateway pod logs the following:

WARN [1030275.972407s] linkerd2_proxy::app::errors request aborted because it reached the configured dispatch deadline

Might have been a fluke, but if when i added Host: example.com to the curl request originally getting 503’s, the request went through and i got 200’s. (And didn’t log the request aborted line) Haven’t been able to test this any further as the issue hasn’t happened since.

Some info about environment:

Running on AWS EKS with Kubernetes version 1.14. The workers have version v1.14.7-eks-1861c5 and are launched with kubelet args:

    --runtime-cgroups=/kube.slice
    --eviction-hard="memory.available<500Mi,nodefs.available<10%,nodefs.inodesFree<5%"
    --system-reserved="cpu=500m,memory=1Gi"
    --kube-reserved="cpu=1000m,memory=1Gi"
    --kube-reserved-cgroup=/kube.slice
    --system-reserved-cgroup=/system.slice
    --node-labels="kubernetes.io/lifecycle=spot,node-role.kubernetes.io/spot-worker=true,custom/worker-ready=true"

Linkerd 2.5.0 was installed with: linkerd install | kubectl apply -f -, and upgraded to 2.6.0 with linkerd upgrade --ha | kubectl apply --prune -l linkerd.io/control-plane-ns=linkerd -f - But if my memory serves me correctly our test environment was simply installed with 2.6.0 linkerd install --ha | kubectl apply -f - and it happened there as well.

Doesn’t feel like it is related to traffic since our test environment i relatively low traffic.

linkerd endpoints seems to align with kubectl get endpoints of the services.

Ran the script @cpretzer linked when the issues was happening and Everything looks okay!