rclcpp: ActionClient Crash: Taking data from action client but nothing is ready
Bug report
Required Info:
- Operating System:
- Ubuntu 22.04
- Installation type:
- binaries
- Version or commit hash:
- iron
- DDS implementation:
- Fast-RTPS
- Client library (if applicable):
- rclcpp
Steps to reproduce issue
Start action server
- server shall accept goal, but not terminate it
Start action client
- send goal
wait
Expected behavior
no crash
Actual behavior
after some time we get the exception
terminate called after throwing an instance of 'std::runtime_error' what(): Taking data from action client but nothing is ready
Additional information
Using a multi threaded spinner, but the default callback group, I guess this one is mutual exclusive
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 25 (14 by maintainers)
This problem is hard to reproduce, you need a certain workload and a actions running with a multithreaded spinner.
As for the explanation of what is going on, this is a long one…
First we need to look at the multithreaded executor: https://github.com/ros2/rclcpp/blob/e2965831d51e9be03470cb07f8721012afcade9b/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp#L82-L100
The important point here is that get_next_executable is called with a lock and execute_any_executable(any_exec) is called without the lock.
Now we need to look at the function https://github.com/ros2/rclcpp/blob/e2965831d51e9be03470cb07f8721012afcade9b/rclcpp_action/src/client.cpp#L266-L274 This function will be called by get_next_executable deep under the hood. The important point here is that the kind of next executable is stored in the global pimpl.
Next is the function https://github.com/ros2/rclcpp/blob/e2965831d51e9be03470cb07f8721012afcade9b/rclcpp_action/src/client.cpp#L622 This function is called by execute_any_executable. As pointed out above, this call is not protected any more by a lock, and any other thread may have been calling ClientBase::is_ready in-between.
As one can see, the execute function uses the values in the global pimpl object to determine the kind of action that needs processing. This it the point were the race occurs, as the global pimpl may have been altered by another thread, leading to the “Executing action client but nothing is ready” exception.
Note, I also saw races between is_ready and take_data. Which are a mystery to me, as this should be protected by the mutex in the spinner. The patch also addresses these.
@jmachowinski, I was thinking about clearing
waitable_triggered_handles_along with other handles inclear_handles()as it’s called by the executor in the beginning ofwait_for_work()routine. However I am not sure about the contract here as potentially the following is possible:collect_entities()adds a waitable in the list, it’s then moved towaitable_triggered_handles_but not getting executed, on the next iteration the same waitable is not added into the list as it’s not in a group anymore hence it’s not executed while it was marked ready for execution on the previous iteration. I am not that familiar with the internals so cannot say for sure if such scenario is even a valid one or not. Something tells it’s not but didn’t want to have more issues so went with a check before adding a waitable instead of clearing the list. If someone familiar with the matter can confirm that then clearing the triggered list might be a better approach.Thank you, @jmachowinski , for additional details. After getting the race condition caused by
execute()fixed I continued to experience “Taking data from action client…” runtime error and at this point it wouldn’t be caused by any race condition asis_ready()andtake_data()couldn’t be called by different threads of the multi threaded executor. So I kept poking around and found the actual cause for it -take_data()might get called multiple times for a single message in the queue.Quick overview of involved functions:
get_next_executable()first checks if there is a ready executable by callingget_next_ready_executable(), if none then it callswait_for_work()and runsget_next_ready_executable()again.get_next_ready_executable()gets a next waitable from the memory strategy and callstake_data()on it. The memory strategy pops next waitable from the triggered waitables list which has an execution slot in its callback group - in case of mutually exclusive group no other callbacks are being executed at the moment.wait_for_work()clears all handles in the memory strategy, collects all handles from registered nodes (this is when our action client is added), then updates the wait set and finally callsremove_null_handles()on the memory strategy. The latter checks what waitables are ready by callingis_ready()on each of them and moves the ready ones to the triggered waitables list.Now what sequence leads to the multiple calls of
take_data():get_next_ready_executable()returns nothing andwait_for_work()is called.is_ready()returned true and it was moved to the triggered waitable listget_next_ready_executable()doesn’t return that waitable as something else is running in its callback group which is set to mutually exclusive mode or some other service was returnedget_next_ready_executable()is called and it returns nothing - our waitable is still blocked by the callback group and it’s being kept in the triggered list.wait_for_work()is called again and finds that our waitable is still ready hence it adds it again to the triggered list. Now we have two records for the same waitable.get_next_ready_executable()finally returns the first waitable from the triggered waitable list which already gottake_data()invoked.take_data()will fail as it doesn’t know what message type to read.So the actual problem must be in AllocatorMemoryStrategy as it blindly adds ready waitables to the triggered list. https://github.com/ros2/rclcpp/blob/e2965831d51e9be03470cb07f8721012afcade9b/rclcpp/include/rclcpp/strategies/allocator_memory_strategy.hpp#L123-L127
I ended up with a simple change to add a waitable only if it’s not present in the triggered list and it fixed the problem for me. This is in addition to a change in
execute()method to not depend on class members for a message type.I also tried your fix and works as well as it makes
is_ready()to read data in place and not intake_data()anymore hence no duplicates in the triggered list possible.Not sure what else might be impacted by the flow described above besides action client base - don’t know what else uses Waitable interface in rclcpp.