stunner: help - intermitent failures connecting to workers on `cloudretro` example on AWS + EKS + ALB

I’ve been building the cloudretro example for a while on multiple kubernetes distributions without issues! Now I’m trying to run this example on AWS setup (EKS + Fargate + ALB) and I’m getting some intermitent errors:

  • Sometimes I’m able to connect to the workers, and other times I have timeouts
  • I suspect it has something to do with the ICE candidates that are reported on the application - I’ve posted evidences below on the differences

Versions:

  • I’m using both stunner and stunner-gateway-operator versions 0.16.0 (chart and app)
  • EKS 1.27

Based on the information below, what is the culprit here? Is there anything I can tweak server side to facilitate this discovery and ensure a successful connection on first attempt?

error attempt

The following errors were captured on the browser using hte Developer tools.

  • Usually when I first reach the coordinator, I usually get the error [rtcp] ice gathering was aborted due to timeout 2000ms.
  • I notice that only one user candidate 02895ab7-03e2-4f4a-9afe-daa99822e2d5.local gets reported and its not reachable (and should not be!)

Error console:

keyboard.js?v=5:128 [input] keyboard has been initialized
joystick.js?v=3:275 [input] joystick has been initialized
touch.js?v=3:304 [input] touch input has been initialized
socket.js?v=4:36 [ws] connecting to wss://home.company.com/ws?room_id=&zone=
socket.js?v=4:42 [ws] <- open connection
socket.js?v=4:43 [ws] -> setting ping interval to 2000ms
controller.js?v=8:79 [ping] <-> {http://worker.company.com:9000/echo: 9999}
rtcp.js?v=4:17 [rtcp] <- received coordinator's ICE STUN/TURN config: [{"urls":"turn:udp.company.com:3478","username":"user-1","credential":"fQvzu2pFOBxtW5Al"}]
rtcp.js?v=4:106 [rtcp] ice gathering
rtcp.js?v=4:120 [rtcp] <- iceConnectionState: checking
rtcp.js?v=4:100 [rtcp] user candidate: {"candidate":"candidate:1680066927 1 udp 2113937151 02895ab7-03e2-4f4a-9afe-daa99822e2d5.local 54853 typ host generation 0 ufrag rj2C network-cost 999","sdpMid":"0","sdpMLineIndex":0,"usernameFragment":"rj2C"}
rtcp.js?v=4:108 [rtcp] ice gathering was aborted due to timeout 2000ms

success attempt

After a couple of retries, we finally have success:

  • Notice that now there are 3 user candidates, one of them is reachable (the one with the public IP)
rtcp.js?v=4:100 [rtcp] user candidate: {"candidate":"candidate:2537811000 1 udp 2113937151 5774c39d-3ada-44b9-b95f-89a858000ac4.local 54903 typ host generation 0 ufrag sRpm network-cost 999","sdpMid":"0","sdpMLineIndex":0,"usernameFragment":"sRpm"}
rtcp.js?v=4:100 [rtcp] user candidate: {"candidate":"candidate:3567609059 1 udp 1677729535 89.180.168.100 42221 typ srflx raddr 0.0.0.0 rport 0 generation 0 ufrag sRpm network-cost 999","sdpMid":"0","sdpMLineIndex":0,"usernameFragment":"sRpm"}
rtcp.js?v=4:100 [rtcp] user candidate: {"candidate":"candidate:4028349664 1 udp 33562623 10.0.22.42 35233 typ relay raddr 

Full log of the success connection:

keyboard.js?v=5:128 [input] keyboard has been initialized
joystick.js?v=3:275 [input] joystick has been initialized
touch.js?v=3:304 [input] touch input has been initialized
socket.js?v=4:36 [ws] connecting to wss://home.company.com/ws?room_id=&zone=
socket.js?v=4:42 [ws] <- open connection
socket.js?v=4:43 [ws] -> setting ping interval to 2000ms
controller.js?v=8:79 [ping] <-> {http://worker.company.com:9000/echo: 9999}
rtcp.js?v=4:17 [rtcp] <- received coordinator's ICE STUN/TURN config: [{"urls":"turn:udp.company.com:3478","username":"user-1","credential":"fQvzu2pFOBxtW5Al"}]
rtcp.js?v=4:106 [rtcp] ice gathering
rtcp.js?v=4:120 [rtcp] <- iceConnectionState: checking
rtcp.js?v=4:100 [rtcp] user candidate: {"candidate":"candidate:2537811000 1 udp 2113937151 5774c39d-3ada-44b9-b95f-89a858000ac4.local 54903 typ host generation 0 ufrag sRpm network-cost 999","sdpMid":"0","sdpMLineIndex":0,"usernameFragment":"sRpm"}
rtcp.js?v=4:100 [rtcp] user candidate: {"candidate":"candidate:3567609059 1 udp 1677729535 89.180.168.100 42221 typ srflx raddr 0.0.0.0 rport 0 generation 0 ufrag sRpm network-cost 999","sdpMid":"0","sdpMLineIndex":0,"usernameFragment":"sRpm"}
rtcp.js?v=4:100 [rtcp] user candidate: {"candidate":"candidate:4028349664 1 udp 33562623 10.0.22.42 35233 typ relay raddr 89.180.168.178 rport 42221 generation 0 ufrag sRpm network-cost 999","sdpMid":"0","sdpMLineIndex":0,"usernameFragment":"sRpm"}
rtcp.js?v=4:113 [rtcp] ice gathering completed
rtcp.js?v=4:120 [rtcp] <- iceConnectionState: connected
rtcp.js?v=4:123 [rtcp] connected...

I appreciate any help on this matter!

About this issue

  • Original URL
  • State: open
  • Created 9 months ago
  • Comments: 15 (9 by maintainers)

Most upvoted comments