livekit: Client is missing published tracks in certain conditions
Describe the bug
@bekriebel reported this in slack. When users are joining the room at the same time from across different regions (where latency is a bigger concern), sometimes clients are reporting TrackSubscriptionFailure events with the following logs:
could not find published track PA_8qxvzPbj3G3R TR_QKFBfQyMBW6Q
addSubscribedMediaTrack @ RemoteParticipant.js?f400:71
eval @ RemoteParticipant.js?f400:78
setTimeout (async)
addSubscribedMediaTrack @ RemoteParticipant.js?f400:77
eval @ RemoteParticipant.js?f400:78
setTimeout (async)
addSubscribedMediaTrack @ RemoteParticipant.js?f400:77
eval @ RemoteParticipant.js?f400:78
According to @bekriebel, he’s able to produce this if the server instance is located in a node far way from him. (Frankfurt to Seattle)
Server
- Version: 0.13.6
Client
- SDK: JS
- Version: 0.13.6
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 31 (31 by maintainers)
@davidzhao I’ve completed my tests. I can’t find any scenario where I can cause the failed track error or the missing participant error 🥳 Nice work!
After stressing it some more, I’ve found that the “user missing a track for all clients” issue appears to be related to the server error
pc ERROR: 2021/10/31 21:26:22 Incoming unhandled RTP ssrc(3954959657), OnTrack will not be fired. incoming SSRC failed Simulcast probing. When I notice a track is missing, it coincides with at error being spammed into the server log. I’m only ever able to trigger that one using my worst-case setup, and it doesn’t happen every time. I can open a separate ticket for this, though - as you said, it’s likely unrelated.Thanks for all of your work on this! Please let me know if there are any other tests you’d like me to run.
I did some more testing with various scenarios. I can reproduce the issue in each of these setups, but there is a noticeable difference in how easy it is to cause it. For each test, I tried combinations of where the room (SFU) was hosted out of and where the signaling node was hosted. Each one is either close (low latency) or far (high latency).
Same node, close: low occurrence of issue SFU close, signaling far: low/medium occurrence of issue Same node, far: medium occurrence of issue SFU far, signaling close: high occurrence of issue
In all cases, it seems to be the speed at which I connect the clients that also matters. If I connect them one after the other within a second or two of each other, I can reproduce the issue. If I pause between adding each client and give the connections a chance to settle, I haven’t been able to reproduce the issue even in the worst case scenario (far SFU, close signaling server). If I use code to get all clients to reconnect at the exact same time, the issue is exacerbated even further.