rclcpp: Executor quick destruction after spin hangs

Bug report

Required Info:

  • Operating System: Ubuntu 20.04
  • Installation type: Binaries
  • Version or commit hash: 5.0.0-1focal.20201007.211227
  • DDS implementation: Fast DDS
  • Client library (if applicable): rclcpp

Steps to reproduce issue

TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
  executor->add_node(node);
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      executor->spin();
    });
  //  std::this_thread::sleep_for(50ms);
  executor->cancel();
}

Expected behavior

That the test ends.

Actual behavior

Very frequently, the test hangs on the future destruction, waiting for thread join(); Because the thread is stuck on executor->spin() (wait_for_work exactly)

Additional information

Uncommenting the sleep_for fixes the issue on my machine, so there seems to be a race condition. In our use case, we have test than end faster than that time and trigger this issue, we can add the sleep as a workaround but it’s ugly: https://github.com/ros-controls/ros2_control/pull/234/files/833b1661209cc12a9d6f9ef45c87f4072e641aa7#diff-28c9aa0318c48cffb66c294bbf17394b6f9295bb8b9fb83d2cac539699f7e354R130

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 18 (16 by maintainers)

Most upvoted comments

@ivanpauno is this issue not beginner friendly?

I’m not pretty sure what we want to do here, so resolving the issue involves first discussing what’s desired. After that is resolved, implementing it shouldn’t be a problem, but the good first issue label doesn’t feel accurate at the moment. If you have any opinion of how it should work feel free to chime in.

I will try to reread the comments again in a few days and post something.

I also feel the discussion somewhat overlaps with how rclcpp::shutdown() should work (https://github.com/ros2/rclcpp/issues/1139), though it’s not exactly the same issue.

The semantics of cancel is strange:

  • If spin is already running, it will return. But you can spin again afterwards, and those will continue blocking.
  • A cancel that happens while the executor is not spinning doesn’t have any effect (the first wait of the executor will be awaken immediately because a guard condition was triggered by cancel, but spin won’t return and will continue doing work).
  • rclpy Executor doesn’t have a cancel method, but rather a shutdown one. spin() calls after shutdown happened will not do work.

I’m not sure what is the intended semantics of cancel, but its current behavior doesn’t sound quite useful (it will always be racy).

Moreover, a cancel that happens while the executor is not spinning doesn’t have any effect.

Maybe instead of that, the first spin after a cancel should not do work. The other option is to make it behave like rclpy.Executor.shutdown.

@wjwwood any ideas?