rclcpp: Action client feedback callback does not reliably trigger

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04 LTS
  • Installation type:
    • Binaries
  • Version or commit hash:
    • Foxy, Rolling
  • DDS implementation:
    • CycloneDDS
  • Client library (if applicable):
    • rclcpp

Expected behavior

My packages’ integration tests should run with the same behavior both locally and in CI.

Actual behavior

When running in an environment with a comparatively small number of CPU cores (for example, 2 cores for Travis and GH Actions runners), my tests sometimes fail because action feedback that I require to pass the test was never received by my action client.

To test this, I’m using the fibonacci_action_server node from the action_tutorials_cpp package, which includes an action server of type action_tutorials_interfaces::action::Fibonacci.

My test node sends a goal to the Fibonacci action with order = 5. I expect the action server to publish 4 feedback messages. In the log output from my tests, I see that the server does publish 4 feedback messages, but the feedback callback registered with my action client is only called for the first 3 messages. Since my test needs to receive 4 feedback messages to succeed, it times out waiting for the 4th message and fails.

taskset -c 0-1 python3 -u /opt/ros/rolling/share/ament_cmake_test/cmake/run_test.py /tmp/asd \
  --package-name rclcpp_multicore_ci_test \
  --env TEST_LAUNCH_DIR=src/rclcpp_multicore_ci_test/test TEST_EXECUTABLE=build/rclcpp_multicore_ci_test/test_fixture_node \
  --command src/rclcpp_multicore_ci_test/test/test_nodes.py
[INFO] [fibonacci_action_server-1]: process started with pid [2053673]
[INFO] [test_node-2]: process started with pid [2053675]
[test_node-2] [==========] Running 1 test from 1 test suite.
[test_node-2] [----------] Global test environment set-up.
[test_node-2] [----------] 1 test from TaskSequenceInterface
[test_node-2] [ RUN      ] TaskSequenceInterface.setup_plan_exec_through_ros
[fibonacci_action_server-1] 1621445668.874821 [0] fibonacci_: using network interface wlp0s20f3 (udp/192.168.0.4) selected arbitrarily from: wlp0s20f3, docker0
[test_node-2] 1621445668.875303 [0] test_fixtu: using network interface wlp0s20f3 (udp/192.168.0.4) selected arbitrarily from: wlp0s20f3, docker0
[fibonacci_action_server-1] [INFO] [1621445668.891735102] [fibonacci_action_server]: Received goal request with order 5
[fibonacci_action_server-1] [INFO] [1621445668.892005547] [fibonacci_action_server]: Executing goal
[fibonacci_action_server-1] [INFO] [1621445668.892090501] [fibonacci_action_server]: Publish feedback
[fibonacci_action_server-1] [INFO] [1621445669.892521092] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1621445669.892611049] [test_node]: Have 1 state feedback messages
[fibonacci_action_server-1] [INFO] [1621445670.892371325] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1621445670.892701509] [test_node]: Have 2 state feedback messages
[fibonacci_action_server-1] [INFO] [1621445671.892378535] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1621445671.892631492] [test_node]: Have 3 state feedback messages
[fibonacci_action_server-1] [INFO] [1621445672.892589940] [fibonacci_action_server]: Goal succeeded
[test_node-2] [ERROR] [1621445678.893078911] [test_node]: Timed out waiting for feedback vector to be populated
[test_node-2] terminate called without an active exception
[test_node-2] /home/jschornak/workspaces/test_ws/src/rclcpp_multicore_ci_test/src/test_fixture_node.cpp:99: Failure
[test_node-2] Value of: node->testActionFeebackCount()
[test_node-2]   Actual: false
[test_node-2] Expected: true
[ERROR] [test_node-2]: process has died [pid 2053675, exit code -6, cmd 'build/rclcpp_multicore_ci_test/test_fixture_node'].
[INFO] [fibonacci_action_server-1]: sending signal 'SIGINT' to process[fibonacci_action_server-1]
[fibonacci_action_server-1] [INFO] [1621445679.071037031] [rclcpp]: signal_handler(signal_value=2)
[INFO] [fibonacci_action_server-1]: process has finished cleanly [pid 2053673]

I think issues similar to this are causing problems in a variety of ROS2 packages, such as during this PR for MoveIt2.

Steps to reproduce issue

I made a minimal package to reproduce this issue: https://github.com/schornakj/rclcpp_multicore_ci_test/

This mirrors how similar integration tests run in my CI pipelines.

More detailed instructions are in the repo’s README.

Additional information

My dev workstation is a Lenovo Thinkpad P15 with an Intel i7-10750H CPU, which has 12 cores, and 16 GB RAM.

Source code for my test executables is below (and also in the repo I linked previously):

test_fixture_node.hpp

#pragma once

#include <rclcpp/rclcpp.hpp>
#include <rclcpp_action/rclcpp_action.hpp>
#include <action_tutorials_interfaces/action/fibonacci.hpp>

namespace rclcpp_multicore_ci_test
{
using action_tutorials_interfaces::action::Fibonacci;
using GoalHandleFibonacci = rclcpp_action::ClientGoalHandle<Fibonacci>;

class TestFixtureNode : public rclcpp::Node
{
public:
  TestFixtureNode();

  bool testActionFeebackCount();

private:
  rclcpp::CallbackGroup::SharedPtr cb_group_me_;
  rclcpp_action::Client<Fibonacci>::SharedPtr fib_client_;
};
}  // namespace rclcpp_multicore_ci_test

test_fixture_node.cpp

#include <rclcpp_thread_test/test_fixture_node.hpp>

#include <gtest/gtest.h>


static const std::size_t ORDER = 5;

namespace rclcpp_multicore_ci_test
{

TestFixtureNode::TestFixtureNode()
  : Node("test_node")
  , cb_group_me_(create_callback_group(rclcpp::CallbackGroupType::MutuallyExclusive))
  , fib_client_(rclcpp_action::create_client<Fibonacci>(this, "/fibonacci", cb_group_me_))
{
}

bool TestFixtureNode::testActionFeebackCount()
{ 
  if(!fib_client_->wait_for_action_server(std::chrono::seconds(3)))
  {
    RCLCPP_ERROR_STREAM(get_logger(), "Timed out waiting for action server to be available");
    return false;
  }

  auto options = rclcpp_action::Client<Fibonacci>::SendGoalOptions();
  options.goal_response_callback = [](std::shared_future<GoalHandleFibonacci::SharedPtr> future) -> void {
    (void)future;
  };

  std::mutex lk;
  std::vector<Fibonacci::Feedback> feedback;
  std::promise<void> promise;

  options.feedback_callback = [=, &feedback, &lk,
                               &promise](GoalHandleFibonacci::SharedPtr gh,
                                         std::shared_ptr<const Fibonacci::Feedback> fb) -> void {
    (void)gh;
    std::scoped_lock _(lk);
    feedback.push_back(*fb);
    RCLCPP_INFO_STREAM(get_logger(), "Have " << feedback.size() << " state feedback messages");
    if (feedback.size() == ORDER - 1)
    {
      promise.set_value();
    }
  };

  options.result_callback = [](const GoalHandleFibonacci::WrappedResult& result) -> void { (void)result; };

  auto goal = Fibonacci::Goal();
  goal.order = ORDER;

  auto fut = fib_client_->async_send_goal(goal, options);

  if (fut.wait_for(std::chrono::seconds(1)) == std::future_status::timeout)
  {
    RCLCPP_ERROR_STREAM(get_logger(), "Timed out waiting for goal response callback");
    return false;
  }

  auto gh = fut.get();
  auto res_fut = fib_client_->async_get_result(gh);
  if (res_fut.wait_for(std::chrono::seconds(ORDER + 1)) == std::future_status::timeout)
  {
    RCLCPP_ERROR_STREAM(get_logger(), "Timed out waiting for result callback");
    return false;
  }

  auto received_state_fut = promise.get_future();
  if (received_state_fut.wait_for(std::chrono::seconds(ORDER + 1)) == std::future_status::timeout)
  {
    RCLCPP_ERROR_STREAM(get_logger(), "Timed out waiting for feedback vector to be populated");
    return false;
  }

  if (feedback.size() != ORDER - 1)
  {
    RCLCPP_ERROR_STREAM(get_logger(), "Wrong number of feedback messages: expected " << ORDER - 1 <<  ", got " << feedback.size() << ".");
    return false;
  }

  return true;
}

}  // namespace rclcpp_multicore_ci_test

TEST(TaskSequenceInterface, setup_plan_exec_through_ros)
{
  auto node = std::make_shared<rclcpp_multicore_ci_test::TestFixtureNode>();

  rclcpp::executors::MultiThreadedExecutor exec;

  auto exec_thread = std::make_unique<std::thread>([&exec, &node]() {
    exec.add_node(node);
    exec.spin();
    exec.remove_node(node);
  });

  ASSERT_TRUE(node->testActionFeebackCount());

  exec.cancel();
  exec_thread->join();
}

int main(int argc, char* argv[])
{
  rclcpp::init(argc, argv);

  testing::InitGoogleTest(&argc, argv);

  rclcpp::NodeOptions options;

  bool all_successful = RUN_ALL_TESTS();

  rclcpp::shutdown();

  return all_successful;
}

test_nodes.py

#!/usr/bin/env python3

import os
import sys
from launch import LaunchService, LaunchDescription
from launch.actions import EmitEvent, ExecuteProcess, RegisterEventHandler
from launch.events import Shutdown
from launch.event_handlers import OnProcessExit
from launch_ros.actions import Node
from launch_testing.legacy import LaunchTestService


def main(argv=sys.argv[1:]):
    test_executable = os.getenv("TEST_EXECUTABLE")

    test_action = ExecuteProcess(
        cmd=[test_executable], name="test_node", output="screen"
    )

    fib_server_node = Node(
        package="action_tutorials_cpp",
        executable="fibonacci_action_server",
        name="fibonacci_action_server",
        output="screen",
    )

    ld = LaunchDescription([])
    ld.add_action(fib_server_node)

    lts = LaunchTestService()
    lts.add_test_action(ld, test_action)

    ls = LaunchService(argv=argv)
    ls.include_launch_description(ld)
    return lts.run(ls)


if __name__ == "__main__":
    sys.exit(main())

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 1
  • Comments: 28 (14 by maintainers)

Most upvoted comments

I think the problem is that goal_handle hadn’t been inserted yet while the feedback message came.

Maybe the GoalHandle should be inserted when sending the goal request and not when receiving the goal response to avoid the race. I’m not completely sure if that’s an okay solution or not.

There are some examples here:

https://design.ros2.org/articles/actions.html#clientserver-interaction-examples

But all of them indicate feedback is sent after the result request is received. I wonder if the intention is that you should not send feedback for a goal until the result request is received. This would ensure that feedback is never sent before the client has received the “goal requested” response.

CC @jacobperron, he might have more insight.

I think the main question is should every feedback_message be received safely by the client? If not, perhaps this is not a real problem.😳

@fujitatomoya I need to set up a workspace to build the ROS2 main branches from source, but once I can do that I’ll test your branch.

If it really receives a feedback message before goal_response message, it’d still be a issue, right?

at least server sends goal_response before feedback, but i do not think it guarantees the same order on client side.