str0m: Slow ICE connection flow compared to libwebrtc

It’s not exactly an issue, but I want to start a discussion to find ways we can speed up the connection when there are multiple ICE candidates (i.e. ice_lite disabled). Although this benefits primarily the p2p use-case, but as previously mentioned I’d like to improve the ICE agent in str0m and we can start here. Because this is critical for our app.

Context: In a p2p application we’d loop over network interfaces, add each one as host and then start adding srflx and relay candidates. st0rm connects instantly if the host candidate works. But when over a network it seems like each added candidate that doesn’t work adds delay to the connection. This delay is very noticeable when ICE agent needs to go over 4-5 candidate pairs to connect.

In my unscientific tests, I manually started a call via 2 libwebrtc peers and 2 str0m peers with the same STUN and TURN servers configured. str0m took 5x the time libwebrtc took to connect.

What do you think can be the issue? Are we going over candidates sequentially?

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Comments: 15 (14 by maintainers)

Most upvoted comments

Relevant: #476

For anybody following along, the issue turned out to be a combination of:

  • Not refreshing our local timer from poll_timeout after changing the agent’s state (i.e. adding candidate)
  • str0m having a minimum advance of 50ms in handle_timeout: https://github.com/algesten/str0m/pull/477

With both of these fixed, I am getting similar results as in #476: str0m needs about 350ms from the changing the state to Checking until the first pair is nominated. This is to a server in the US from Australia so with better latency, I’d assume it is even less.

Candidate pairs on the same relay should be prioritzed over pairs on different relays to reduce latency (1 less hop). If we find a candidate pair on the same relay, we may even want to stop testing the others.

This would mean both sides effectively have the same IP address? Could that be generalised to “same IP” regardless of type of candidate?

I think testing host <> relay candidates is very unlikely to be useful. We may even want to drop those candidate pairs altogether? TURN servers are deployed to the public internet, attempt to reach them via our host candidate is essentially the same as trying to reach them from our server-reflexive candidate. Unless I am missing something, host candidates are mostly useful for hole-punching & direct LAN connections so we should de-prioritize or even omit pairs of host <> relay.

I’m probably missing something, but… our standard use case for an SFU, is a server with a public IP and clients behind NAT, firewalls etc. Wouldn’t host <> relay be the most likely then? It’s quite different to peer-peer.

Or taking a step back, why would removing any pairs be an advantage? Less noise?

It would be useful if str0m would eventually stop testing certain pairs once we have connectivity. Something like: Okay, we hole-punched a connection, lets keep another pair based on relay candidates active but stop testing the others? Or in the case of a relayed connection, keep another one as a backup (different IP) in case this TURN server becomes unavailable.

Sure. Let’s discuss possible strategies on Zulip.

A few more thoughts:

  • Candidate pairs on the same relay should be prioritzed over pairs on different relays to reduce latency (1 less hop). If we find a candidate pair on the same relay, we may even want to stop testing the others.
  • I think testing host <> relay candidates is very unlikely to be useful. We may even want to drop those candidate pairs altogether? TURN servers are deployed to the public internet, attempt to reach them via our host candidate is essentially the same as trying to reach them from our server-reflexive candidate. Unless I am missing something, host candidates are mostly useful for hole-punching & direct LAN connections so we should de-prioritize or even omit pairs of host <> relay.
  • It would be useful if str0m would eventually stop testing certain pairs once we have connectivity. Something like: Okay, we hole-punched a connection, lets keep another pair based on relay candidates active but stop testing the others? Or in the case of a relayed connection, keep another one as a backup (different IP) in case this TURN server becomes unavailable.

Thanks! One question: is it safe to use one certificate for multiple connections or I should make a pool?

They are strictly use once, or you’re opening up a security hole. Hm. I see it’s Clone. That’s no good. I’ll fix that now.

Nice finds!

  1. I started by reducing the MAX_RTO from 3000 to 1500, it helped as I was seeing success in 3-4 attempts.

Let’s double check this against libWebRTC. I don’t think there’s a problem lowering it, but that also means more STUN packets being sent in a short time.

  1. I noticed the process after a nomination success -> connected state (after DTLS is established and all) is sometimes >%30 of the connection time (measured since first nominated pair).

This could potentially be the certificate generation DtlsCert::new(). This is why @xnorpx made RtcConfig::set_dtls_cert() so that new certificates can be made ahead of time, or in another thread at the same time as starting the STUN negotiation.

@algesten excited to work on debugging this, thanks for the info!

Thanks for raising this.

I don’t think there’s anything deliberately slowing things down. I think all pairs are tested at the same time.

To explain what’s happening. As you know, you add ice candidates, both local and remote.

Local candidates are combined with remote candidates into pairs. Pairs are considers differently good, a direct host-host connection is better than going through turn servers.

Once a pair is formed, we start making STUN requests with that pair as sender/receiver.

If a STUN requests goes through and we receive an answer, the pair is a success. We nominate the pair as the active one.

The best prio successful pair “wins”.

The easiest way to understand why this takes time is to turn on TRACE or add println.

pair.rs combines a pair of candidates. That’s a good starting point to println and understand why this takes time.

Link me any code that doesn’t make sense and i’ll explain what it does.