metallb: Make ARP mode failover faster with virtual MACs

Is this a bug report or a feature request?:

Bug

What happened:

While writing the documentation for ARP mode, I realized that the way we’ve implemented the failover probably doesn’t work in all cases. The problem is that we are not using a single MAC address, so the flooding we use to try and update egress port maps on switches is useless.

As a reminder, our current behavior when a node is master elected is:

  1. Respond to ARP who-has requests for service-ip with the node’s MAC address
  2. Periodically flood ARP responses for service-ip with the node’s MAC address

The purpose of (1) is so that end hosts can resolve IP->MAC, to transmit the packets with the correct dst-mac+dst-ip. The purpose of (2) is to teach learning switches the MAC->egress-port mapping.

In a non-failover scenario, this works fine, and in fact in that scenario (2) is not needed because (1) indirectly teaches the switches in the path as well as the end-host.

In a failover scenario from node A to node B, this all falls apart. For new clients things work okay: they send an ARP request, receive a response that maps IP->node-B-MAC, and everything works fine. Existing clients have the old mapping of IP->node-A-MAC cached, so they keep transmitting to node A’s MAC address.

In theory, (2) is supposed to fix the second case. When node B becomes master, it floods a bunch of ARP responses that advertise a mapping of IP->node-B-MAC. End hosts ignore these (because they didn’t request that information). Switches listen to these messages, but their internal maps are MAC->egress-port, not IP->MAC or IP->egress-port. So, they happily update that node-B-MAC lives on port 42… And this does nothing for the traffic that is still going to node-A-MAC.

I made a mistake when I sketched out the design for ARP mode. Specifically, I forgot that VRRP defines a “virtual router MAC address”, and uses that MAC instead of the node’s MAC address for virtual router services.

So, when using VRRP, clients receive a mapping of IP->VRRP-MAC instead of IP->node-X-MAC, and switches learn a mapping of VRRP-MAC->port-42. When a failover happens, the new master floods ARP responses for VRRP-MAC, and this correctly updates the switch mapping to VRRP-MAC->port-50, and so all traffic, past and future, gets rerouted to the new master.

What does this mean for MetalLB?

  1. The flooding behavior I insisted we implement is basically useless. It’s flooding irrelevant information to the switches, so we can just delete it 😦
  2. ARP mode failover right now is very non-ideal. If you’re doing a graceful drain of a node, you should wait >60s after MetalLB leader failover before powering off the node. This should be long enough to let most clients refresh their ARP cache. Linux uses a 60s TTL, windows uses 15—45s. I can’t find reliable documentation about MacOS, but one place suggests it uses a 20 minute TTL, which is pretty ridiculous.
  3. We should investigate implementing virtual MAC address support. This is probably going to be quite complex, because we have to modify the configuration of the network interface (otherwise the NIC will drop packets going to the virtual MAC), so this could have some pretty bad side-effects like split-brain if we get it wrong 😦

For 0.3, it’s probably okay to just document the limitations of ARP mode for now, unless we find a really easy way to implement virtual MACs without much complication. The current ARP mode is still very useful in general, it just has some poor behaviors during (hopefully rare) failovers. @miekg, wdyt?

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 29 (24 by maintainers)

Most upvoted comments

we use keepalived vrrp and failover has always been instantaneous in production and testing. We have used it for years.

Now, it appears the metallb code is sending gratuitous arps, but sending them as replys. This post seems to point out that kernels do not respect replys, only requests.

https://ihrachyshka.com/2017/05/25/the-failure-part-3-diggin-the-kernel/

We do the same, but Linux boxes don’t accept gratuitous arps (for instance).

I don’t think that’s true. Keepalived can failover its VIP instantly, including for Linux clients.

Bingo! It uses macvlan.

Interestingly, by default keepalived does not use virtual MACs with VRRP, it does exactly what ARP speaker currently does, and just accepts that the failover will not be instant. You have to explicitly activate virtual MAC mode in the config. So maybe the current, simple behavior is actually fine for many cases. Should we just make virtual MAC an optional feature?

When you activate use_vmac, keepalived just does exactly what we said here: create macvlan, set virtual MAC addr, set interface up. Keepalived doesn’t mess with NOARP, and just lets the kernel deal with arp (except for the flooding on failover, that’s manual). I still think there’s probably a benefit in our case if we use NOARP, imho.

Other concern with macvlan: depending on the network addon in use on the cluster, it’s possible the network interfaces are weird, and we cannot guarantee that we can correctly create macvlan interfaces. I don’t know what we can do about that, except maybe say “we’ve tested with calico and flannel, and that covers 99% of all k8s clusters”…

I wonder what they mean by “vmac”, that’s not a linux network stack term that I know. It’s possible they literally just mean “virtual mac address”, but I wonder how they “create” that in the kernel.

I’m setting up keepalived in a toy container now to see what it looks like when VRRP is running.