rclpy: `MultiThreadedExecutor:spin_until_future_complete` can block when the future is ready

Bug report

Required Info:

Operating System:
- Ubuntu 20.04
Installation type:
- From source
Version or commit hash:
- Foxy
DDS implementation:
- Fast-RTPS
Client library (if applicable):
- rclpy

Steps to reproduce the issue

Publish a message from one node (with latching_qos)
In another node, subscribe to the topic with a callback that sets the result of the future
Start both nodes in a MuliThreadedExecutor
Spin until future complete

import rclpy
from rclpy.executors import MultiThreadedExecutor, SingleThreadedExecutor
from rclpy.node import Node
from std_msgs.msg import String
from rclpy.qos import QoSProfile, QoSDurabilityPolicy
from rclpy.task import Future
import os

latching_qos = QoSProfile(depth=1,
    durability=QoSDurabilityPolicy.RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL)

def main():

    rclpy.init()
    
    # Set up publisher
    pubnode = Node('pubnode_' + str(os.getpid()))
    pub1 = pubnode.create_publisher(String, 'topic1', latching_qos)
    msg1 = String()
    msg1.data = "hello1"
    pubnode.get_logger().info("Publishing hello1")
    pub1.publish(msg1)

    # Set up listener
    future_msgs = Future()
    subnode = Node('subnode_' + str(os.getpid()))
    subnode.create_subscription(String, 'topic1', lambda msg : ([
            subnode.get_logger().info("Received message on topic1"),
            future_msgs.set_result(msg)
    ]), latching_qos)

    # Start nodes
    exe = MultiThreadedExecutor()
    exe.add_node(pubnode)
    exe.add_node(subnode)

    future_msgs.add_done_callback(lambda fut : print("Future is done"))
    exe.spin_until_future_complete(future_msgs)

if __name__ == '__main__':
    main()

Expected behavior

The subnode should receive the message, set the future as complete, and then the program should exit.

Actual behavior

The subnode receives the message, sets the future as complete, but the exe.spin_until_future_complete(future_msgs) never returns.

Additional information

This only happens with the MultiThreadedExector. If I swap this out for the SingleThreadedExecutor then it works as expected.

If the pub node is running in a different process, it also works as expected.

I have also asked a question on ROS Answers here.

Here is a workspace that you can clone and run to immediately test this issue. Here is the code for the node defined in that workspace. It is very similar to the code posted in this bug report, but with a few argparse options.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 17 (6 by maintainers)

Commits related to this issue

MultiThreadedExecutor:spin_until_future_complete can block when the future is ready https://github.com/ros2/rclpy/issues/585 Signed-off-by: Tomoya Fujita <Tomoya.Fujita@sony.com> — committed to fujitatomoya/ros2_test_prover by fujitatomoya 3 years ago

Most upvoted comments

As @fujitatomoya explained here, there’s no deadlock. I’ve updated the PR title accordingly.

https://github.com/ros2/rclpy/pull/605 is a proposed fix. @craigh92 @fujitatomoya it would be great if you can confirm that the proposed fix solves the issue in the posted example, thanks!

ivanpauno on Jul 20, 2020

sure @fujitatomoya I will do so.

russkel on Aug 18, 2021

IMO, custom executor should define spin_impl() but spin_once(),

spin_once() can be method of Executor, spin just once with single thread.
spin() of Executor calls custom executor’s spin_impl(), which is dependent on the implementation.

i may be missing something, I’d like to hear from the others.

fujitatomoya on Jul 1, 2020

This is NOT deadlock issue, main thread is just waiting via rcl_wait. future is actually done set_result by executor.submit(callback) but before future gets set_result, main thread will call spin_once(). Then it waits on rcl_wait for the next ready event. This rcl_wait will never be fired with this sample program, because there is nothing to do and timeout is not set either.

i am not sure how to fix it, any thoughts?

does it need to call MultiThreadedExecutor::spin_once?
we could provide and use spin_some?

btw, setting the timeout for spin_until_future_complete will avoid this problem.

fujitatomoya on Jun 25, 2020