nebula: Unable to achieve 10 Gbit/s throughput on Hetzner server

I’m benchmarking Nebula with storage servers from dedicated server provider Hetzner where 10 Gbit/s links are cheap.

Unless you ask them to connect your servers by a dedicated switch, the MTU cannot be changed, so jumbo frames are not possible.

In this setup, I have not been able to achieve more than 2 Gbit/s with iperf3 over Nebula, no matter how I tune read_buffer/write_buffer/batch/routines.

In https://theorangeone.net/posts/nebula-intro/ it was said

Slack have seen Nebula networks fully saturate 5 / 10 gigabit links without sweat

and on https://i.reddit.com/r/networking/comments/iksyuu/overlay_network_mesh_options_nebula_wireguard/

Slack regularly does many gigabits per second over nebula on individual hosts.

but that’s evidently not the case for me.

Did all those setups use jumbo frames?

Is there anything that can be done to achieve 10 Gbit/s throughput without jumbo frames?

About this issue

Original URL
State: open
Created 2 years ago
Comments: 21

Most upvoted comments

The content of this comment is the most telling for me https://github.com/slackhq/nebula/issues/637#issuecomment-1086643211

When you are testing your underlay network with multiple flows directly (5 in that run) you see maximum throughput of about 9.5Gbit/s, a single flow gets about 4Gbit/s. When you run with nebula you see nearly the same throughput as the single flow underlay network test at 3.5 Gbit/s.

Nebula will (currently) only be 1 flow on the underlay network between two hosts. The throughput limitation is likely to be anything between and/or including the two NICs in the network since it looks like you have already ruled out cpu on the host directly.

The folks at Slack have run into similar situations with AWS and this PR may be of interest to you https://github.com/slackhq/nebula/pull/768

https://github.com/slackhq/nebula/issues/637#issuecomment-1086671441

I do not see the output for ss -numpile but I do see the output for the system wide drop counters. It looks like you are doing a number of performance tests using UDP on the overlay and it is very possible the nettcp or iperf3 udp buffers are overflowing while nebula buffers are not.

ss -numpile will output the kernel skmem struct per socket for all sockets on the system. I usually do sudo ss -numpile | grep -A1 nebula to ensure I am only looking at nebula sockets when tuning (-A1 is assuming you are configured to run with a single routine).

nbrownus on Dec 7, 2022

Overall the “many gigabits per second” relates to exactly what @nbrownus mentions above. This cited number is in aggregate.

At Slack, we didn’t encounter workloads that have single path host-to-host tunnels trying to do 10gbit/s, but with a small-ish MTU. Nebula allows you to configure MTUs for different network segments, and Slack uses this internally across production. I do understand that in your case, Hetzner does not allow a higher MTU, which contributes to this bottleneck.

More broadly, Nebula’s default division of work is per-tunnel. If you have 4+ hosts talking to a single host over Nebula, and you turn on muiltiroutine processing, Nebula will quickly match the maximum line rate of a single 10gbit interface.

In the case of Ceph, are you often sending many gbit/s between individual hosts?

We are certainly open to enhancing this if more people ask for a bump when using individual tunnels with small MTUs. We will also be sharing our research here in a future blog post for people to validate, and which will have tips for optimizing performance.

rawdigits on Oct 16, 2023