linkerd2: Problems with HA on lost node
Bug Report
What is the issue?
It seems that the HA doesn’t work properly when a k8s node is lost (out of memory, high load, kernel panic, etc…) The issue seems to apply the “destination” and “proxy injector” components but could apply to others.
How can it be reproduced?
1 - Check which node a replica of the component “proxy injector” is running on and shut it down (a simple poweroff is enough). The pod will wait as long as the “pod-eviction-timeout” states before it gets tagged as unknown and gets rescheduled (default : 5m) 2 - During this time, the pod is tagged as “Ready: False” by k8s, endpoints of the service get updated, but if you delete a meshed pod, it will not be rescheduled (because the default of linkerd-proxy-injector-webhook-config is “failurePolicy: Fail”. If you switch this value to “Ignore”, the pod will be scheduled but unmeshed.
Logs, error output, etc
[linkerd-identity-6758697995-m2w2q linkerd-proxy] WARN [ 2477.529728s] linkerd-dst.linkerd.svc.k8s-predev.bee-labs.net:8086 linkerd2_reconnect::service Service failed error=channel closed the-deleted-replica-74558f4547-rdzbl 0/1 Running 0 37s 10.233.113.238 sucv-k8s-worker04.predev.bee-labs.net <none> the-not-deleted-replica-74558f4547-sljr7 2/2 Running 0 13m 10.233.94.198 sucv-k8s-worker05.predev.bee-labs.net <none>
linkerd check
output
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
× can initialize the client
error upgrading connection: error dialing backend: dial tcp 10.0.4.142:10250: connect: no route to host
see https://linkerd.io/checks/#l5d-existence-client for hints
Status check results are ×
It usually is all green but this happens because the node it’s trying to join is the one i shut down. After i boot it back up :
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
× [prometheus] control plane can talk to Prometheus
Error calling Prometheus from the control plane: server_error: server error: 503
see https://linkerd.io/checks/#l5d-api-control-api for hints
Status check results are ×
But all pods are up :
NAME READY STATUS RESTARTS AGE
linkerd-controller-76cfbf4bd4-dqhqn 3/3 Running 0 30m
linkerd-controller-76cfbf4bd4-rm6bg 3/3 Running 0 10m
linkerd-controller-76cfbf4bd4-sv8tj 3/3 Running 0 56m
linkerd-destination-5b986b988d-dpn4d 2/2 Running 0 56m
linkerd-destination-5b986b988d-krlrl 2/2 Running 0 30m
linkerd-destination-5b986b988d-sp8lk 2/2 Running 0 10m
linkerd-grafana-7b8d5d7b7d-4c9rz 2/2 Running 0 30m
linkerd-identity-6758697995-5vvtt 2/2 Running 0 10m
linkerd-identity-6758697995-m2w2q 2/2 Running 0 56m
linkerd-identity-6758697995-w2qlm 2/2 Running 0 30m
linkerd-prometheus-6c6b987b49-f5sp9 2/2 Running 0 10m
linkerd-proxy-injector-78d78864bb-24wvq 2/2 Running 0 30m
linkerd-proxy-injector-78d78864bb-bhmv2 2/2 Running 0 56m
linkerd-proxy-injector-78d78864bb-xgx5t 2/2 Running 0 10m
linkerd-sp-validator-74b497d499-4pwnd 2/2 Running 0 10m
linkerd-sp-validator-74b497d499-d9n4k 2/2 Running 0 30m
linkerd-sp-validator-74b497d499-rm9g4 2/2 Running 0 56m
linkerd-tap-5bffd9c666-bsmfc 2/2 Running 0 56m
linkerd-tap-5bffd9c666-fqfvh 2/2 Running 0 30m
linkerd-tap-5bffd9c666-r958k 2/2 Running 0 10m
linkerd-web-56578968f-zclt7 2/2 Running 0 10m
Environment
- Kubernetes Version: 1.12.3
- Cluster Environment: kubespray on vmware VMs
- Host OS: ubuntu 16.04
- Linkerd version: 2.6
Possible solution
Additional context
- We have problems when a nodes crash and are trying to solve that out
- We also have problems with “destination” but it’s harder to reproduce than the proxy injectors so i just hope that the problem has a common source =)
- We do not have the problem if we scale the replicas down to 0 or if we delete pods, only on lost node.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 20 (12 by maintainers)
The proxy injector issue here seems to be due to the k8s bug #80313 i mentioned. I’ve been able to try in 1.17 and couldn’t reproduce the problem. I still have an issue hower and will reopen another issue specifically.