linkerd2: Linkerd fails during node outage

Bug Report

What is the issue?

During node outage where some of linkerd components “linkerd-destination” and “linkerd-identity” were running, it looks like linkerd keeps sending traffic to pods on the failed node. In my cluster Linkerd is installed in HA mode.

How can it be reproduced?

  1. Installed linkerd in HA mode
  2. Mesh two apps, where app A talks to app B or vise versa
  3. Stop node where one of linkerd component “linkerd-destination” was running (The goal is to simulate a node outage which has happened to us and lead us to seeing this bug) . During this time the pod will wait for 5 minutes as per eviction timeout and get rescheduled in a new node after 5 minutes.
  4. Randomly one or two of app pods will fail to make call to another app
  5. This issue happens only when there is ungraceful shutdown of nodes

linker-proxy logs

2020-06-18T20:48:30.798501272Z [ 14263.555872710s]  WARN outbound:accept{peer.addr=172.21.14.116:52074}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: request timed out
2020-06-18T20:48:31.317647919Z [ 14264.75155158s]  WARN outbound:accept{peer.addr=172.21.14.116:52074}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
2020-06-18T20:48:31.317687819Z [ 14264.75275658s]  WARN outbound:accept{peer.addr=172.21.14.116:54590}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
2020-06-18T20:48:32.322513181Z [ 14265.80191420s]  WARN outbound:accept{peer.addr=172.21.14.116:52074}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast```

```2020-06-18T22:36:56.281636047Z [   110.160896474s]  WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate
2020-06-18T22:36:56.785530039Z [   110.664675566s]  WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate
2020-06-18T22:36:57.288315626Z [   111.167593054s]  WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate
2020-06-18T22:36:57.79501223Z [   111.674226057s]  WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate

linkerd check output

--------------                                                                      
√ can initialize the client                                                         
√ can query the Kubernetes API                                                      
                                                                                    
kubernetes-version                                                                  
------------------                                                                  
√ is running the minimum Kubernetes API version                                     
√ is running the minimum kubectl version                                            
                                                                                    
linkerd-existence                                                                   
-----------------                                                                   
√ 'linkerd-config' config map exists                                                
√ heartbeat ServiceAccount exist                                                    
√ control plane replica sets are ready                                              
√ no unschedulable pods                                                             
√ controller pod is running                                                         
√ can initialize the client                                                         
√ can query the control plane API                                                   
                                                                                    
linkerd-config                                                                      
--------------                                                                      
√ control plane Namespace exists                                                    
√ control plane ClusterRoles exist                                                  
√ control plane ClusterRoleBindings exist                                           
√ control plane ServiceAccounts exist                                               
√ control plane CustomResourceDefinitions exist                                     
√ control plane MutatingWebhookConfigurations exist                                 
√ control plane ValidatingWebhookConfigurations exist                               
√ control plane PodSecurityPolicies exist                                           
                                                                                    
linkerd-identity                                                                    
----------------                                                                    
√ certificate config is valid                                                       
√ trust roots are using supported crypto algorithm                                  
√ trust roots are within their validity period                                      
√ trust roots are valid for at least 60 days                                        
√ issuer cert is using supported crypto algorithm                                   
√ issuer cert is within its validity period                                         
√ issuer cert is valid for at least 60 days                                         
√ issuer cert is issued by the trust root                                           
                                                                                    
linkerd-api                                                                         
-----------                                                                         
√ control plane pods are ready                                                      
√ control plane self-check                                                          
√ [kubernetes] control plane can talk to Kubernetes                                 
√ [prometheus] control plane can talk to Prometheus                                 
√ tap api service is running                                                        
                                                                                    
linkerd-version                                                                     
---------------                                                                     
√ can determine the latest version                                                  
‼ cli is up-to-date                                                                 
    is running version 2.7.1 but the latest stable version is 2.8.1                 
    see https://linkerd.io/checks/#l5d-version-cli for hints                        
                                                                                    
control-plane-version                                                               
---------------------                                                               
‼ control plane is up-to-date                                                       
    is running version 2.7.1 but the latest stable version is 2.8.1                 
    see https://linkerd.io/checks/#l5d-version-control for hints                    
√ control plane and cli versions match                                              
                                                                                    
linkerd-ha-checks                                                                   
-----------------                                                                   
√ pod injection disabled on kube-system                                             
                                                                                    
Status check results are √ 

Environment

  • Kubernetes Version: 1.16.9
  • Cluster Environment: AKS
  • Linkerd version: stable-2.7.1

Possible solution

Additional context

There is similar open github issue, I am not 100 percent sure they are the same

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 23 (19 by maintainers)

Most upvoted comments

All of the proxy changes have merged, but we hit some issues while testing the required control plane changes (discussed in https://github.com/linkerd/linkerd2/issues/4912). I expect these changes to make it into next week’s edge release.

@jawhites We’re hoping to merge this in time for this week’s edge release.

Okey @ericsuhong, @Abrishges In order to test this:

  1. Grab my control plane branch: https://github.com/linkerd/linkerd2/tree/zd/headless
  2. Run bin/build-cli-bin
  3. Run bin/linkerd install --ha --proxy-version=cplb | kubectl apply -f - (add whatever options you need for your usecase)

All the images you need are pushed so you will be good to go. Now you can manually fail selected nodes, etc and verify that this fixes your problem. I myself has a server and client application running and failed some of the other nodes that has linkerd identity running on them. Before this fix, I would observe what you are describing. The changes fix this problem for me. As always, the more logs the better 😃