str0m: Slow ICE connection flow compared to libwebrtc
It’s not exactly an issue, but I want to start a discussion to find ways we can speed up the connection when there are multiple ICE candidates (i.e. ice_lite
disabled). Although this benefits primarily the p2p use-case, but as previously mentioned I’d like to improve the ICE agent in str0m and we can start here. Because this is critical for our app.
Context: In a p2p application we’d loop over network interfaces, add each one as host
and then start adding srflx
and relay
candidates. st0rm connects instantly if the host
candidate works. But when over a network it seems like each added candidate that doesn’t work adds delay to the connection. This delay is very noticeable when ICE agent needs to go over 4-5 candidate pairs to connect.
In my unscientific tests, I manually started a call via 2 libwebrtc peers and 2 str0m peers with the same STUN and TURN servers configured. str0m
took 5x the time libwebrtc took to connect.
What do you think can be the issue? Are we going over candidates sequentially?
About this issue
- Original URL
- State: open
- Created 7 months ago
- Comments: 15 (14 by maintainers)
For anybody following along, the issue turned out to be a combination of:
poll_timeout
after changing the agent’s state (i.e. adding candidate)handle_timeout
: https://github.com/algesten/str0m/pull/477With both of these fixed, I am getting similar results as in #476: str0m needs about 350ms from the changing the state to
Checking
until the first pair is nominated. This is to a server in the US from Australia so with better latency, I’d assume it is even less.This would mean both sides effectively have the same IP address? Could that be generalised to “same IP” regardless of type of candidate?
I’m probably missing something, but… our standard use case for an SFU, is a server with a public IP and clients behind NAT, firewalls etc. Wouldn’t host <> relay be the most likely then? It’s quite different to peer-peer.
Or taking a step back, why would removing any pairs be an advantage? Less noise?
Sure. Let’s discuss possible strategies on Zulip.
A few more thoughts:
They are strictly use once, or you’re opening up a security hole. Hm. I see it’s
Clone
. That’s no good. I’ll fix that now.Nice finds!
Let’s double check this against libWebRTC. I don’t think there’s a problem lowering it, but that also means more STUN packets being sent in a short time.
This could potentially be the certificate generation
DtlsCert::new()
. This is why @xnorpx madeRtcConfig::set_dtls_cert()
so that new certificates can be made ahead of time, or in another thread at the same time as starting the STUN negotiation.@algesten excited to work on debugging this, thanks for the info!
Thanks for raising this.
I don’t think there’s anything deliberately slowing things down. I think all pairs are tested at the same time.
To explain what’s happening. As you know, you add ice candidates, both local and remote.
Local candidates are combined with remote candidates into pairs. Pairs are considers differently good, a direct host-host connection is better than going through turn servers.
Once a pair is formed, we start making STUN requests with that pair as sender/receiver.
If a STUN requests goes through and we receive an answer, the pair is a success. We nominate the pair as the active one.
The best prio successful pair “wins”.
The easiest way to understand why this takes time is to turn on TRACE or add println.
pair.rs
combines a pair of candidates. That’s a good starting point to println and understand why this takes time.Link me any code that doesn’t make sense and i’ll explain what it does.