rosbridge_suite: rosbridge_server hanging randomly under load

Expected Behavior

rosbridge_server does not hang.

Actual Behavior

rosbridge_server hangs under some load (several connections, quite high rates (50Hz), but relatively little data throughput).

Steps to Reproduce the Problem

  1. Run rosbridge_server v 0.11.2 on an RPi 3B+
  2. Connect severally to it from a web browser.
  3. Wait.

Specifications

  • ROS Version (echo $ROS_DISTRO): kinetic
  • OS Version (grep DISTRIB_CODENAME /etc/lsb-release): Raspbian GNU/Linux 9 (stretch)
  • Rosbridge Version (roscat rosbridge_server package.xml | grep '<version>'): 0.11.2
  • Tornado Version (python -c 'import tornado; print tornado.version'): 4.4.3

We are having trouble with rosbridge_server (0.11.2) freezing when under a fair load (number of connections on a somewhat flaky network, rather than throughput). The freeze appears to be random, and occurs after some period, typically 10-30 seconds.

I investigated, and it looks like it’s a race condition for _write_lock, similar to that in issue #403, but not solved by the change of #413 (which is present at 0.11.2).

After some testing, and reading the issue reports, I noticed that somebody had suggested that the behaviour of “yield” inside a “with lock” section wasn’t certain. So I tried:

def prewrite_message(self, message, binary):
    # Use a try block because the log decorator doesn't cooperate with @coroutine.
    try:
        with self._write_lock:
            future = self.write_message(message, binary)

        # When closing, self.write_message() return None even if it's an undocument output.
        # Consider it as WebSocketClosedError
        if future is None:
            raise WebSocketClosedError

        yield future

That is, I un-indented after the line “future =”, so the lock is released once the call has been made, but before the yield call.

Immediately, the behaviour seems to be fixed, and it has now been running robustly for several minutes without freezing again. I cannot comment at this stage on whether there have been any side effects, or data corruption, but the upstream signals look fine so far.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 61 (23 by maintainers)

Commits related to this issue

Most upvoted comments

Releasing #502 in 0.11.8 for kinetic and melodic. Thanks everybody that has helped track this down and test.

@benmitch @reinzor @nickvaras @Axel13fr @BrodieBirko @ifollowbenoit @basecoplan @J-Rojas

Let me know how it goes of course, my application only uses a small set of rosbridge’s features so we depend on your reports.

I hope to solve most Tornado issues by removing Tornado from rosbridge.

Also if you’re not already, make sure all your topics have non-zero queue_length or else they’re all coupled to the rospy.Subscriber thread together.

I tried to repro this with a couple of different mobile devices but couldn’t quite hit the magic. Then I tried sending SIGSTOP to a Firefox client and got the condition where other clients are hanging.

The problem in rosbridge is that while a QueueMessageHandler thread is blocking in the websocket OutgoingValve, it also blocks adding new messages to the queue and pushing off the old ones, which then blocks the rospy topic thread. This means the queue decoupling I mentioned doesn’t actually work.

The solution is to not hold the queue Condition while pushing data to the OutgoingValve. Please try this PR and see if it fixes things for you.

I’ll be working on an automated stress test to reproduce these kinds of issues.

Please try #496 and see if it helps.

I think this explains what the problem is. To summarize, an incoming message is blocked by an outgoing message (or the other way around, depending on how you look at it). A hypothesized sequence of events:

  1. A client connects to the socket.
  2. Client subscribes to some topics. Data starts flowing to the client.
  3. The client can’t process messages fast enough and the outgoing backpressure lock is closed by the IOLoop thread.
  4. The client sends a message to explicitly unsubscribe from a topic to which it is the only subscriber. This immediately invokes unregister() on the Subscription from the IOLoop thread which synchronously calls finish() on the QueueMessageHandler which is blocked from writing the rest of its queue because the QueueMessageHandler thread is already waiting for the backpressure lock. Only the IOLoop thread can open the backpressure lock, but that thread is waiting for the QueueMessageHandler thread.

This would happen in all backpressure-protected releases.

Possible specific mitigations:

  • When Subscriber.unregister() is calling self.handler.finish(), run it in a different Thread to decouple it from the server thread.
  • When Subscriber.unregister() is calling self.handler.finish(), add a flag to MessageHandler.finish() to indicate that it should not publish any more messages or block for any reason. In QueueMessageHandler this would clear the queue. This is the most appealing to me, because otherwise you will keep receiving backlogged messages after sending unsubscribe op.

A more general mitigation:

  • Queue all self.protocol.incoming() calls at the rosbridge_server level and run them in a separate Thread. It would eliminate this entire class of deadlock (incoming message blocked by outgoing messages). Each server implementation would be responsible for doing this.

I think a specific mitigation is safest because this is an issue at the rosbridge_library level. The more general incoming/outgoing decoupling creates more problems and decisions and should be held back in case specific mitigations can’t solve the problem.

The bunch of buffered messages is due to the following: when the rosbridge server doesn’t kick out on timeout, it assumes the client is still connected and continues to write on the TCP socket. The TCP socket OS buffer will fill up so that when your front end / client comes back online, it will receive the whole buffered content as fast as the connection allow transfers for it.

But after that, I’ve seen the websocket hanging in some cases.

@mvollrath here are some more details of all the testing we did and their results. I haven’t started diving into the code yet.

@nickvaras the problem occurs when switching networks as we reproduced it but not limited to that.

We managed to reproduce the problem by doing so:

  • setting a ping timeout and a ping interval so that non responsive clients get kicked out by the server
  • using a web front end with ROS lib js to connect to the websocket server (which subscribes to about 30 topics and 15 services)
  • after a successfull connection from the front end, suddenly disconnect it by for example switching off Wifi (after some time, due to the ping timeout, the server will disconnect the client). The front end won’t detect that it’s not connected anymore due to long websocket timeouts.
  • connect Wifi/network back so the front end suddenly receives a “close” socket.
  • when the front end reconnects a few secs later, the server registers the topics as before and then HANGS.

Note that this problem happens with 0.11.3 and 0.11.4 so it seems independant from the websocket implementation. We have had other occurences of this hanging but couldn’t reproduce them.

Note as well that we made the following stress test: opening about 10 front end instances at the same time so that the rosbridge server load would go 100% CPU on our embedded PC when registering all topics. This didn’t cause any issues. But when we do the same thing with just one client connecting as described above, it did. So I’m not sure at all that this is related to a load issue but rather a TCP socket still alive problem…