metallb: Failover time very high in layer2 mode

Is this a bug report or a feature request?:

Both, probably.

What happened:

I tested with layer 2 mode and simulated node failure by shutting down the node that the load balancer IP was on. What then happened it that it took approx. 5 minutes for the IP to be switched to the other node in the cluster (1 master, 2 nodes). After much experimentation I came to the conclusion that the node being down and being “NotReady” did not initiate the switch of the IP address. The 5 minute timeout seems to be caused by the default pod eviction timeout of Kubernetes, which is 5 minutes. That means it takes 5 minutes for a pod on a node that is not available to be deleted. Default “node monitor grace period is 40 seconds, btw.”. So that means it currently takes almost 6 minutes with default confguration for an IP address to be switched.

I made things a lot better by decreasing both settings like this: - --pod-eviction-timeout=20s - --node-monitor-grace-period=20s in /etc/kubernetes/manifests/kube-controller-manager.yaml

This makes MetalLB switch the IP in case of node failure in the sub-minute range.

What you expected to happen:

To be honest what I would expect that the whole process takes maybe max. 5 seconds.

How to reproduce it (as minimally and precisely as possible):

Create a Kubernetes 1.11.1 cluster with kubeadm (single master, two nodes). Calico networking.

kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.7.2/manifests/metallb.yaml kubectl apply -f metallb-cfg.yml kubectl apply -f tutorial-2.yaml

➜ metallb-test cat metallb-cfg.yml apiVersion: v1 kind: ConfigMap metadata: namespace: metallb-system name: config data: config: | address-pools: - name: default protocol: layer2 addresses: - 10.115.195.206-10.115.195.208

Then

watch curl --connect-timeout 1 http://10.115.195.206

to see if the nginx app is reachable.

Then

kubectl logs -f --namespace metallb-system speaker-xxxxxxxxx

To see which node has the IP address assigned at the moment. ssh into the machine and “poweroff”.

Wait for how long it takes until the “watch curl” is successful again.

Anything else we need to know?:

Environment:

MetalLB version: v0.7.2
Kubernetes version: v1.11.1
BGP router type/version: N/A
OS (e.g. from /etc/os-release): CentOS 7
Kernel (e.g. uname -a):Linux cp-k8s-ghdev02-node-01.ewslab.eos.lcl 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 4
Comments: 44 (20 by maintainers)

Commits related to this issue

[WIP] Use hashicorp/memberlist to speedup dead node detection By default MemberList is disabled This fixes #298 TODO: - fixup MemberList logs {"caller":"announcer.go:112","event":"createNDPResponde... — committed to champtar/metallb by champtar 4 years ago
Use hashicorp/memberlist to speedup dead node detection By default MemberList is disabled, so new behaviour is opt-in on upgrade This fixes #298 Signed-off-by: Etienne Champetier <echampetier@anevi... — committed to champtar/metallb by champtar 4 years ago
Use hashicorp/memberlist to speedup dead node detection By default MemberList is disabled, so new behaviour is opt-in on upgrade This fixes #298 Signed-off-by: Etienne Champetier <echampetier@anevi... — committed to champtar/metallb by champtar 4 years ago
Use hashicorp/memberlist to speedup dead node detection By default MemberList is disabled, so new behaviour is opt-in on upgrade This fixes #298 Signed-off-by: Etienne Champetier <echampetier@anevi... — committed to champtar/metallb by champtar 4 years ago
Use hashicorp/memberlist to speedup dead node detection By default MemberList is disabled, so new behaviour is opt-in on upgrade This fixes #298 Signed-off-by: Etienne Champetier <echampetier@anevi... — committed to champtar/metallb by champtar 4 years ago
Use hashicorp/memberlist to speedup dead node detection By default MemberList is disabled, so new behaviour is opt-in on upgrade This fixes #298 Signed-off-by: Etienne Champetier <echampetier@anevi... — committed to champtar/metallb by champtar 4 years ago
Use hashicorp/memberlist to speedup dead node detection By default MemberList is disabled, so new behaviour is opt-in on upgrade This fixes #298 Signed-off-by: Etienne Champetier <echampetier@anevi... — committed to champtar/metallb by champtar 4 years ago
Use hashicorp/memberlist to speedup dead node detection By default MemberList is disabled, so new behaviour is opt-in on upgrade This fixes #298 Signed-off-by: Etienne Champetier <echampetier@anevi... — committed to champtar/metallb by champtar 4 years ago
Use hashicorp/memberlist to speedup dead node detection By default MemberList is disabled, so new behaviour is opt-in on upgrade This fixes #298 Signed-off-by: Etienne Champetier <echampetier@anevi... — committed to champtar/metallb by champtar 4 years ago
Use hashicorp/memberlist to speedup dead node detection By default MemberList is disabled, so new behaviour is opt-in on upgrade This fixes #298 Signed-off-by: Etienne Champetier <echampetier@anevi... — committed to champtar/metallb by champtar 4 years ago
Use hashicorp/memberlist to speedup dead node detection By default MemberList is disabled, so new behaviour is opt-in on upgrade This fixes #298 Signed-off-by: Etienne Champetier <echampetier@anevi... — committed to champtar/metallb by champtar 4 years ago
Use hashicorp/memberlist to speedup dead node detection By default MemberList is disabled, so new behaviour is opt-in on upgrade This fixes #298 Signed-off-by: Etienne Champetier <echampetier@anevi... — committed to champtar/metallb by champtar 4 years ago
Use hashicorp/memberlist to speedup dead node detection By default MemberList is disabled, so new behaviour is opt-in on upgrade This fixes #298 Signed-off-by: Etienne Champetier <echampetier@anevi... — committed to metallb/metallb by champtar 4 years ago

Most upvoted comments

This is definitely a bug. In MetalLB 0.6, failover time was at max 10 seconds, because we had explicit leadership elections and so we could detect node failure much faster. In 0.7, for scaling, I switched to trusting k8s to tell us the state of the cluster. Unfortunately, I assumed that it was way better than it actually is 😦

So, failover time for layer2 is definitely unacceptable in 0.7. Node failures should recover in seconds, not minutes. Now, how can we fix that?..

The obvious choice is to reintroduce leader election. But, to make 0.7’s features work, now the leader election has to be per-service. That means a linearly growing amount of k8s control plane traffic for each LB service, more CPU consumption to manage the leadership state, and probably less good distribution of services across nodes because a leadership race is more open to racing.

For reference, if we set the leader election timeout to 10s, that means we should ping the object every ~5s to renew our lease, so that’s 0.2qps of k8s control plane traffic for every LoadBalancer service in the cluster. That can get huge pretty quickly 😕

Another option would be to maintain a separate “healthy speakers” object of some kind, where each speaker pings its liveness periodically. Then each speaker could still do stateless decisions for leadership, and just filter the list of potential pods based on which speakers are alive. This is still leader election, but now the qps cost is O(nodes) instead of O(services). This won’t work either I think, many services have way more nodes than LB services.

Either way, we need to resolve this, because 0.7 has made the failover time of layer2 mode unacceptable.

danderson on Aug 9, 2018

Dropping by to keep this thread alive. Any progress on this?

I have great interest in using metallb in our baremetal k8s, but the failover is the only requirement holding us to keepalived.

Would it make sense to implement a highly available controller with leader election independent from the k8s api? In my head, making metallb dependent on k8s to check for service readiness does not amount to a robust load balance solution. The assumption is that there is no need to expose a public IP if the service does not have any ready endpoints.

In my experience, that is mostly an edge case where your entire deployment of pods suddenly became unavailable. For the everyday load balance requirement, the service will have a number of pods always available while some nodes/endpoints come and go randomly.

As I understand, metallb should work independently and rely mostly on node availability. Service endpoints should be a secondary concern. The idea of a virtual IP floating around cluster nodes comes from the need to keep this IP highly available and resilient to node failure.

juliohm1978 on Mar 30, 2019

@jenciso but having to add tolerations to all of your deployments is something that can be very tedious. Unless someone has Istio deployed which seems to have support for manipulating tolerations since https://github.com/istio/istio/pull/13044. In any case, a little bit impractical for small customers.

@danderson any plans on getting this fixed for ARP/L2? Having to switch to BGP is not an option for us right now, mostly because our Calico CNI already peers with the upstream router. I tried an experiment which consists of peering MetalLB with Calico and Calico with the upstream routers, but this is not fully supported yet.

teralype on Nov 22, 2019

Oh, and one more idea: change the architecture of MetalLB, and make the controller responsible for selecting an announcing node for each service, instead of making the speakers do it. The controller has a more global view of the world, and it could healthcheck the speakers explicitly to help it make decisions. As a bonus, it would make it much easier to have a status page on the controller that shows the full state of the cluster… Possibly. Need to think about that, it’s a huge change to MetalLB.

danderson on Aug 9, 2018

Ok, I rerun the test. Here are the results.

#!/usr/bin/python -u
import requests
import time

def test():
    try:
        response = requests.get("http://10.115.195.206", timeout=0.5)
    except:
        return False
    return True

if __name__ == "__main__":
    while 1:
        result = test()
        print time.time(), result
        time.sleep(1)

I started the script, prepared fetching the logs, then shut down the node where the load balancer IP pointed to.

Results of the HTTP test script

1534092121.95 True
...
1534092293.57 True
1534092294.68 True
1534092295.8 True
1534092296.92 True
1534092297.97 False
1534092299.48 False
...
1534092603.64 False
1534092605.14 False
1534092606.65 False
1534092607.76 True
1534092608.87 True
1534092610.08 True

metallb-test python -c “print 1534092607.76 - 1534092297.97” 309.789999962

ca. 300 seconds => 5 minutes `

I also have the speaker logs, but the timestamps are off compared to the test script (UTC vs. local time).

The node was unhealthy in the expected time, but failover took 5 minutes.

ghaering on Aug 12, 2018

Hello,

So as a current work-around, what k8s control plane settings do one need to tune in order to get the 40s down to like 5?

I don’t think 5s is currently possible. At https://metallb.universe.tf/concepts/layer2/ we can read :

If the leader node fails for some reason, failover is automatic: the old leader’s lease times out after 10 seconds, at which point another node becomes the leader and takes over ownership of the service IP.

So I think 10s is currently the minimum amount of time that could be expected for the failover to happen. I was able to achieve 10~11s by also setting controller-manager’s --node-monitor-grace-period.

As failover time was critical for me I ended up using easy-keepalived from @juliohm1978. I was already familiar with keepalived as we use it internally for a bunch of other services (non-k8s managed/related services).

andrelop-zz on Jul 9, 2019

@andrelop

I developed easy-keepalived as a prototype a few months back.

My team adopted it as a baseline. Internally, we evolved it into a python implementation and a more recent keepalived version. The only drawback to consider in terms of security is that, to provide public external IPs to users outside a Kubernetes cluster, you will need to run the deployment using the host network.

The idea for easy-keepalived is that you can use a number of nodes in your cluster to keep a group of virtual IPs highly available to the outside world. It uses a simplified yaml file to configure a keepalived cluster for a full fail over and load balance. We use that with nginx-ingress-controller also running bound to the host newtork.

Feel free to fork that use as a starting point.

juliohm1978 on Jul 1, 2019

@johnarok

That is the experience I had with metallb. In my case, even with a failover of 5 seconds would not be acceptable. Our current keepalived setup provides that in less than a second.

The failover delay is the main reason we haven’t switched to metallb.

juliohm1978 on Feb 27, 2019

@burnechr In my test case I used i single-replica nginx deployment that was exposed through MetalLB. Does that answer your question?

ghaering on Aug 14, 2018

MetaLB 0.7 uses endpoint readiness, which is based on pod readiness. The Kubernetes node lifecycle controller marks the Pod as no longer ready if their Node becomes NotReady. Waiting for the pod eviction timeout (default 5 minutes) should not be necessary. I am not sure why that did not work. @ghaering maybe you can try increasing the logging of the kube-controller-manager to see how it reacts to node readiness.

I think the MetalLB behavior makes sense. It matches how kube-proxy works. MetalLB doing faster IP failover does not help much if the service endpoints are partly dead, but still in the loadbalacing pool.

If 40 seconds is too high, it’s possible to tweak the node monitor periods. That increases the load on the API and etcd, though, but it’s feasible. There is ongoing work to make the node heartbeats cheaper.

mxey on Aug 10, 2018