linkerd2: Problems with HA on lost node

Bug Report

What is the issue?

It seems that the HA doesn’t work properly when a k8s node is lost (out of memory, high load, kernel panic, etc…) The issue seems to apply the “destination” and “proxy injector” components but could apply to others.

How can it be reproduced?

1 - Check which node a replica of the component “proxy injector” is running on and shut it down (a simple poweroff is enough). The pod will wait as long as the “pod-eviction-timeout” states before it gets tagged as unknown and gets rescheduled (default : 5m) 2 - During this time, the pod is tagged as “Ready: False” by k8s, endpoints of the service get updated, but if you delete a meshed pod, it will not be rescheduled (because the default of linkerd-proxy-injector-webhook-config is “failurePolicy: Fail”. If you switch this value to “Ignore”, the pod will be scheduled but unmeshed.

Logs, error output, etc

[linkerd-identity-6758697995-m2w2q linkerd-proxy] WARN [ 2477.529728s] linkerd-dst.linkerd.svc.k8s-predev.bee-labs.net:8086 linkerd2_reconnect::service Service failed error=channel closed the-deleted-replica-74558f4547-rdzbl 0/1 Running 0 37s 10.233.113.238 sucv-k8s-worker04.predev.bee-labs.net <none> the-not-deleted-replica-74558f4547-sljr7 2/2 Running 0 13m 10.233.94.198 sucv-k8s-worker05.predev.bee-labs.net <none>

`linkerd check` output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
× can initialize the client
    error upgrading connection: error dialing backend: dial tcp 10.0.4.142:10250: connect: no route to host
    see https://linkerd.io/checks/#l5d-existence-client for hints

Status check results are ×

It usually is all green but this happens because the node it’s trying to join is the one i shut down. After i boot it back up :

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
× [prometheus] control plane can talk to Prometheus
    Error calling Prometheus from the control plane: server_error: server error: 503
    see https://linkerd.io/checks/#l5d-api-control-api for hints

Status check results are ×

But all pods are up :

NAME                                      READY   STATUS    RESTARTS   AGE
linkerd-controller-76cfbf4bd4-dqhqn       3/3     Running   0          30m
linkerd-controller-76cfbf4bd4-rm6bg       3/3     Running   0          10m
linkerd-controller-76cfbf4bd4-sv8tj       3/3     Running   0          56m
linkerd-destination-5b986b988d-dpn4d      2/2     Running   0          56m
linkerd-destination-5b986b988d-krlrl      2/2     Running   0          30m
linkerd-destination-5b986b988d-sp8lk      2/2     Running   0          10m
linkerd-grafana-7b8d5d7b7d-4c9rz          2/2     Running   0          30m
linkerd-identity-6758697995-5vvtt         2/2     Running   0          10m
linkerd-identity-6758697995-m2w2q         2/2     Running   0          56m
linkerd-identity-6758697995-w2qlm         2/2     Running   0          30m
linkerd-prometheus-6c6b987b49-f5sp9       2/2     Running   0          10m
linkerd-proxy-injector-78d78864bb-24wvq   2/2     Running   0          30m
linkerd-proxy-injector-78d78864bb-bhmv2   2/2     Running   0          56m
linkerd-proxy-injector-78d78864bb-xgx5t   2/2     Running   0          10m
linkerd-sp-validator-74b497d499-4pwnd     2/2     Running   0          10m
linkerd-sp-validator-74b497d499-d9n4k     2/2     Running   0          30m
linkerd-sp-validator-74b497d499-rm9g4     2/2     Running   0          56m
linkerd-tap-5bffd9c666-bsmfc              2/2     Running   0          56m
linkerd-tap-5bffd9c666-fqfvh              2/2     Running   0          30m
linkerd-tap-5bffd9c666-r958k              2/2     Running   0          10m
linkerd-web-56578968f-zclt7               2/2     Running   0          10m

Environment

Kubernetes Version: 1.12.3
Cluster Environment: kubespray on vmware VMs
Host OS: ubuntu 16.04
Linkerd version: 2.6

Possible solution

Additional context

We have problems when a nodes crash and are trying to solve that out
We also have problems with “destination” but it’s harder to reproduce than the proxy injectors so i just hope that the problem has a common source =)
We do not have the problem if we scale the replicas down to 0 or if we delete pods, only on lost node.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 20 (12 by maintainers)

Most upvoted comments

The proxy injector issue here seems to be due to the k8s bug #80313 i mentioned. I’ve been able to try in 1.17 and couldn’t reproduce the problem. I still have an issue hower and will reopen another issue specifically.

falcoriss on Dec 19, 2019