rclcpp: Exception thrown while waiting for action result: RCLError "failed to add guard condition to wait set"

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04 LTS
  • Installation type:
    • Binaries
  • Version or commit hash:
    • Humble, from the 2023-01-27 sync snapshot
  • DDS implementation:
    • CycloneDDS
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

Calling rclcpp_action::Client::async_get_result() after rclcpp_action::Client::async_send_goal() sometimes throws rclcpp::exceptions::RCLError with a “failed to add guard condition to wait set: guard condition implementation is invalid” message.

While I don’t have a totally-atomic code sample that reproduces this issue independent of my project, the code in question is similar to this:

  const auto goal_handle_future = client_->async_send_goal(goal);
  if (goal_handle_future.wait_for(goal_response_timeout_) == std::future_status::timeout)
  {
    // handle failure to get response to goal request
    return false;
  }

  const auto goal_handle = goal_handle_future.get();
  if (!goal_handle)
  {
    // handle rejection of the goal request
    return false;
  }

  auto wrapped_result_future = client_->async_get_result(goal_handle);
  if (wrapped_result_future.wait_for(goal_result_timeout_) == std::future_status::timeout)
  {
    // handle timeout before action result is received
    return false;
  }

  // handle success
  return true;

Expected behavior

If the goal_handle future returned from async_send_goal() is set (indicating that the goal request was accepted), then I should always be able to use the goal_handle to wait for the action result.

Actual behavior

For some small proportion of action goals (on the order of 1 in 100), an exception is thrown:

    [test_node_main-1] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
    [test_node_main-1]   what():  failed to add guard condition to wait set: guard condition implementation is invalid, at ./src/rcl/guard_condition.c:172, at ./src/rcl/wait.c:460
    [test_node_main-1] Stack trace (most recent call last) in thread 18342:
    [test_node_main-1] #16   Object "/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2", at 0xffffffffffffffff, in 
    [test_node_main-1] #15   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe4282bb3, in __clone
    [test_node_main-1] #14   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe41f1b42, in 
    [test_node_main-1] #13   Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe44812b2, in 
    [test_node_main-1] #12   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b418e1, in rclcpp::executors::MultiThreadedExecutor::run(unsigned long)
    [test_node_main-1] #11   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b3aab2, in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
    [test_node_main-1] #10   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b37d94, in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
    [test_node_main-1] #9    Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b4ac86, in 
    [test_node_main-1] #8    Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b385ab, in rclcpp::detail::add_guard_condition_to_rcl_wait_set(rcl_wait_set_s&, rclcpp::GuardCondition const&)
    [test_node_main-1] #7    Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b34838, in rclcpp::exceptions::throw_from_rcl_error(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rcutils_error_state_s const*, void (*)())
    [test_node_main-1] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe445323d, in std::rethrow_exception(std::__exception_ptr::exception_ptr)
    [test_node_main-1] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe44532b6, in std::terminate()
    [test_node_main-1] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe445324b, in 
    [test_node_main-1] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe4447bbd, in 
    [test_node_main-1] #2    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe41857f2, in abort
    [test_node_main-1] #1    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe419f475, in raise
    [test_node_main-1] #0    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe41f3a7c, in pthread_kill
    [test_node_main-1] Aborted (Signal sent by tkill() 18291 0)

Additional information

Based on the stack trace, this looks like a problem at the intersection of the rclcpp action client and the MultiThreadedExecutor, which is a place where I’ve had issues roughly similar to this one in the past.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 17 (4 by maintainers)

Most upvoted comments

In the new executor structures (https://github.com/ros2/rclcpp/pull/2143/files), I have worked around this by making the node return a rclcpp::GuardCondition::SharedPtr, which makes it more consistent with the CallbackGroup API.

These guard conditions are all added to a single waitable and held as weak pointers, which are locked right before adding to the rcl_wait_set so that we can ensure they are still valid: https://github.com/ros2/rclcpp/blob/mjcarroll/executor_structures/rclcpp/src/rclcpp/executors/executor_notify_waitable.cpp#L46-L64

Hi,

If I can help, I was doing a tutorial and found this problem. The error is reproducible, which may help solve the problem by providing a testbed for this issue.

https://github.com/fmrico/vector_transmission

In different terminals

ros2 run rclcpp_components component_container
ros2 component load /ComponentManager vector_transmission vector_transmission::VectorProducer
ros2 component load /ComponentManager vector_transmission vector_transmission::VectorConsumer

The error always happens when unloading the Consumer:

ros2 component unload /ComponentManager 2

I hope this helps!!

I’ll try to create a reproducible example now that I have a clearer idea of what’s going on.

Yeah, that would be fantastic. It would be good to have a test that shows this problem, and thus shows that @mjcarroll 's fixes actually fix it (hopefully).

shared_ptr wasn’t used here. So this problem occurs.

I have an update in flight to the executors that should make the handling of the guard conditions more thread-safe. I would be interested if you can reproduce with this PR @schornakj https://github.com/ros2/rclcpp/pull/2142