geckos.io: Geckos 2.0 http server hangs after a while
Hey, there! I recently upgraded to geckos 2.0. Everything has been working swimmingly, except for one mysterious problem. After about two days of runtime, my geckos server starts timing out all requests. It sits behind an NGINX proxy, so the first error I see is:
2021/11/09 15:44:56 [error] 15785#15785: *5965624 upstream timed out (110: Connection timed out) while reading response header from upstream, request: "POST /geckos/.wrtc/v2/connections HTTP/1.1", upstream: "http://127.0.0.1:3030/.wrtc/v2/connections"
eventually followed by the similar:
2021/11/09 16:06:34 [error] 15785#15785: *5969357 upstream timed out (110: Connection timed out) while connecting to upstream, request: "POST /geckos/.wrtc/v2/connections HTTP/1.1", upstream: "http://127.0.0.1:3030/.wrtc/v2/connections"
I haven’t found any correlated errors coming from the geckos server itself. It just suddenly starts hanging and won’t reply anymore 😦 CPU and MEM usage seem normal.
Any idea how best to dig into this?
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 45 (31 by maintainers)
@marcwehbi Thank you for the gdb output, it looks like a regression introduced in libdatachannel v0.15.4. I refactored the transports teardown and it looks like it might make the PeerConnection deadlock on close. The timings triggering the deadlock must happen on one machine and not on the other for some reason.
Haven’t seen any deadlocks in the last 4 days! Looks fixed!
@murat-dogan @paullouisageneau @yandeu thank you all! I just deployed geckos v2.1.4. I was seeing deadlock about once per day before, so I’ll wait a couple of days, see what happens, then close this bad boy.
This should be fixed in libdatachannel v0.15.6, this is the PR to update node-datachannel: https://github.com/murat-dogan/node-datachannel/pull/64
@bennlich In the trace, a thread waits for a lock somewhere in
rtc::impl::PeerConnection::closeTransports()::{lambda()#1}::operator()()
while another waits for the first one to finish inrtc::impl::Processor::join()
. There is not debug info, but given the scenario, it appears the lock is related to callback synchronization and rightfully held by the second thread. The mistake was that callbacks were reset at the wrong place incloseTransports()
, creating the deadlock risk.@murat-dogan Thanks!
I also just released geckos v2.1.4
@paullouisageneau Thanks a lot 👍🏻😊🥳
Just want to share an automated script to install the geckos example on ubuntu 20.04 (AWS EC2).
It will not solve this issue. But maybe it helps anyways 😃
Security Group (Firewall)
Installation Script
The name of the user is
ubuntu
.I’ve just swapped to Twilio’s STUN server, will see in an hour or so if it freezes again. If it does im just gonna destroy the droplet and restore it from the EU image to see if maybe its some server configuration i overlooked.