rosbridge_suite: rosbridge_server hanging randomly under load
Expected Behavior
rosbridge_server does not hang.
Actual Behavior
rosbridge_server hangs under some load (several connections, quite high rates (50Hz), but relatively little data throughput).
Steps to Reproduce the Problem
- Run rosbridge_server v 0.11.2 on an RPi 3B+
- Connect severally to it from a web browser.
- Wait.
Specifications
- ROS Version (
echo $ROS_DISTRO): kinetic - OS Version (
grep DISTRIB_CODENAME /etc/lsb-release): Raspbian GNU/Linux 9 (stretch) - Rosbridge Version (
roscat rosbridge_server package.xml | grep '<version>'): 0.11.2 - Tornado Version (
python -c 'import tornado; print tornado.version'): 4.4.3
We are having trouble with rosbridge_server (0.11.2) freezing when under a fair load (number of connections on a somewhat flaky network, rather than throughput). The freeze appears to be random, and occurs after some period, typically 10-30 seconds.
I investigated, and it looks like it’s a race condition for _write_lock, similar to that in issue #403, but not solved by the change of #413 (which is present at 0.11.2).
After some testing, and reading the issue reports, I noticed that somebody had suggested that the behaviour of “yield” inside a “with lock” section wasn’t certain. So I tried:
def prewrite_message(self, message, binary):
# Use a try block because the log decorator doesn't cooperate with @coroutine.
try:
with self._write_lock:
future = self.write_message(message, binary)
# When closing, self.write_message() return None even if it's an undocument output.
# Consider it as WebSocketClosedError
if future is None:
raise WebSocketClosedError
yield future
That is, I un-indented after the line “future =”, so the lock is released once the call has been made, but before the yield call.
Immediately, the behaviour seems to be fixed, and it has now been running robustly for several minutes without freezing again. I cannot comment at this stage on whether there have been any side effects, or data corruption, but the upstream signals look fine so far.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 61 (23 by maintainers)
Commits related to this issue
- Don't block Subscription.unregister() Fixes #425 — committed to EndPointCorp/rosbridge_suite by mvollrath 4 years ago
- Decouple incoming WS handling from server thread Fixes #425 — committed to EndPointCorp/rosbridge_suite by mvollrath 4 years ago
Releasing #502 in 0.11.8 for kinetic and melodic. Thanks everybody that has helped track this down and test.
@benmitch @reinzor @nickvaras @Axel13fr @BrodieBirko @ifollowbenoit @basecoplan @J-Rojas
Let me know how it goes of course, my application only uses a small set of rosbridge’s features so we depend on your reports.
I hope to solve most Tornado issues by removing Tornado from rosbridge.
Also if you’re not already, make sure all your topics have non-zero
queue_lengthor else they’re all coupled to therospy.Subscriberthread together.I tried to repro this with a couple of different mobile devices but couldn’t quite hit the magic. Then I tried sending SIGSTOP to a Firefox client and got the condition where other clients are hanging.
The problem in rosbridge is that while a
QueueMessageHandlerthread is blocking in the websocketOutgoingValve, it also blocks adding new messages to the queue and pushing off the old ones, which then blocks the rospy topic thread. This means the queue decoupling I mentioned doesn’t actually work.The solution is to not hold the queue
Conditionwhile pushing data to theOutgoingValve. Please try this PR and see if it fixes things for you.I’ll be working on an automated stress test to reproduce these kinds of issues.
Please try #496 and see if it helps.
I think this explains what the problem is. To summarize, an incoming message is blocked by an outgoing message (or the other way around, depending on how you look at it). A hypothesized sequence of events:
unregister()on theSubscriptionfrom the IOLoop thread which synchronously callsfinish()on theQueueMessageHandlerwhich is blocked from writing the rest of its queue because theQueueMessageHandlerthread is already waiting for the backpressure lock. Only the IOLoop thread can open the backpressure lock, but that thread is waiting for theQueueMessageHandlerthread.This would happen in all backpressure-protected releases.
Possible specific mitigations:
Subscriber.unregister()is callingself.handler.finish(), run it in a differentThreadto decouple it from the server thread.Subscriber.unregister()is callingself.handler.finish(), add a flag toMessageHandler.finish()to indicate that it should not publish any more messages or block for any reason. InQueueMessageHandlerthis would clear the queue. This is the most appealing to me, because otherwise you will keep receiving backlogged messages after sending unsubscribe op.A more general mitigation:
self.protocol.incoming()calls at therosbridge_serverlevel and run them in a separateThread. It would eliminate this entire class of deadlock (incoming message blocked by outgoing messages). Each server implementation would be responsible for doing this.I think a specific mitigation is safest because this is an issue at the rosbridge_library level. The more general incoming/outgoing decoupling creates more problems and decisions and should be held back in case specific mitigations can’t solve the problem.
The bunch of buffered messages is due to the following: when the rosbridge server doesn’t kick out on timeout, it assumes the client is still connected and continues to write on the TCP socket. The TCP socket OS buffer will fill up so that when your front end / client comes back online, it will receive the whole buffered content as fast as the connection allow transfers for it.
But after that, I’ve seen the websocket hanging in some cases.
@mvollrath here are some more details of all the testing we did and their results. I haven’t started diving into the code yet.
@nickvaras the problem occurs when switching networks as we reproduced it but not limited to that.
We managed to reproduce the problem by doing so:
Note that this problem happens with 0.11.3 and 0.11.4 so it seems independant from the websocket implementation. We have had other occurences of this hanging but couldn’t reproduce them.
Note as well that we made the following stress test: opening about 10 front end instances at the same time so that the rosbridge server load would go 100% CPU on our embedded PC when registering all topics. This didn’t cause any issues. But when we do the same thing with just one client connecting as described above, it did. So I’m not sure at all that this is related to a load issue but rather a TCP socket still alive problem…