rclcpp: SharedFuture from async_send_request never becomes valid

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04.1 LTS (osrf/ros:humble-desktop docker image)
  • Installation type:
    • from source
  • Version or commit hash:
    • 4fa3489cfd075affb34812878b034ef8a462e379
  • DDS implementation:
    • both fastrtps and cyclonedds
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

Build and run the unit test below

#include <gtest/gtest.h>
#include <rclcpp/rclcpp.hpp>

#include <std_srvs/srv/set_bool.hpp>

#include <string>

using namespace std::chrono;

class TestRegularService : public ::testing::Test
{
public:
    void SetUp() override
    {
        rclcpp::init(0, nullptr);

        m_node_executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();

        // Create a node
        auto opt = rclcpp::NodeOptions();
        auto node = rclcpp::Node::make_shared("server_node", opt);
        m_nodes.push_back(node);
        m_node_executor->add_node(node->get_node_base_interface());
        m_service = rclcpp::create_service<std_srvs::srv::SetBool>(
            node->get_node_base_interface(),
            node->get_node_services_interface(),
            "example_service",
            [node](const std::shared_ptr<std_srvs::srv::SetBool::Request> request, 
                const std::shared_ptr<std_srvs::srv::SetBool::Response> response) {
                static unsigned int serial_num = 1;
                (void)request;
                RCLCPP_INFO(node->get_logger(), "Received service client request... Sending response: %d", serial_num);
                response->success = true;
                response->message = std::to_string(serial_num++);
            },
            rclcpp::ServicesQoS().get_rmw_qos_profile(),
            nullptr);

        node = rclcpp::Node::make_shared("client_node", opt);
        m_nodes.push_back(node);
        m_node_executor->add_node(node);
        m_client = node->create_client<std_srvs::srv::SetBool>("example_service");

        

        m_node_thread = std::thread(std::bind([this]()
        {
            this->m_node_executor->spin();
        }));
    }

    ~TestRegularService()
    {
        if (rclcpp::ok())
        {
            rclcpp::shutdown();
        }
        m_node_thread.join();
    }

    std::vector<std::shared_ptr<rclcpp::Node>> m_nodes;
    std::shared_ptr<rclcpp::executors::SingleThreadedExecutor> m_node_executor;
    rclcpp::ServiceBase::SharedPtr m_service;
    rclcpp::Client<std_srvs::srv::SetBool>::SharedPtr m_client;
    std::thread m_node_thread;

};

TEST_F(TestRegularService, test_regular_service)
{
    auto request = std::make_shared<std_srvs::srv::SetBool::Request>();
    request->data = true;

    while (!m_client->wait_for_service(1s)) {
        if (!rclcpp::ok()) {
        RCLCPP_ERROR(m_nodes.front()->get_logger(), "Interrupted while waiting for the service. Exiting.");
        return;
        }
        RCLCPP_INFO(m_nodes.front()->get_logger(), "service not available, waiting again...");
    }
    uint counter = 1;
    do {
        RCLCPP_INFO(m_nodes.front()->get_logger(), "Sending request: %d" , counter);
        auto result = m_client->async_send_request(request);
        RCLCPP_INFO(m_nodes.front()->get_logger(), "Waiting for response: %d" , counter);
        auto answer = result.get();
        ASSERT_TRUE(answer->success);
        RCLCPP_INFO(m_nodes.front()->get_logger(), "Got response: %s", answer->message.c_str());
    }while(++counter<1000000);
}

int main(int argc, char **argv)
{
    ::testing::InitGoogleTest(&argc, argv);
    return RUN_ALL_TESTS();
}

Expected behavior

The unit test is expected to send an async request and a valid result will be retrieved with SharedFuture::get() later. The unit should repeat the process 1000000 times without issues.

Actual behavior

The unit test hangs randomly regardless of the underline DDS. Example snippet:

Start testing: Oct 31 20:35 UTC
----------------------------------------------------------
71/105 Testing: test-regular-ros2-service
71/105 Test: test-regular-ros2-service
Command: "/usr/bin/python3.10" "-u" "/ros2_humble/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/ros2_humble/build/rclcpp/test_results/rclcpp/test-regular-ros2-service.gtest.xml" "--package-name" "rclcpp" "--output-file" "/ros2_humble/build/rclcpp/ament_cmake_gtest/test-regular-ros2-service.txt" "--command" "/ros2_humble/build/rclcpp/test/rclcpp/test-regular-ros2-service" "--gtest_output=xml:/ros2_humble/build/rclcpp/test_results/rclcpp/test-regular-ros2-service.gtest.xml"
Directory: /ros2_humble/build/rclcpp/test/rclcpp
"test-regular-ros2-service" start time: Oct 31 20:35 UTC
Output:
----------------------------------------------------------
-- run_test.py: invoking following command in '/ros2_humble/build/rclcpp/test/rclcpp':
 - /ros2_humble/build/rclcpp/test/rclcpp/test-regular-ros2-service --gtest_output=xml:/ros2_humble/build/rclcpp/test_results/rclcpp/test-regular-ros2-service.gtest.xml
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestRegularService
[ RUN      ] TestRegularService.test_regular_service
[INFO] [1667248559.329850983] [server_node]: Sending request: 1
[INFO] [1667248559.329986353] [server_node]: Waiting for response: 1
[INFO] [1667248559.330659463] [server_node]: Received service client request... Sending response: 1
[INFO] [1667248559.330995192] [server_node]: Got response: 1
[INFO] [1667248559.331019237] [server_node]: Sending request: 2
[INFO] [1667248559.331041372] [server_node]: Waiting for response: 2
[INFO] [1667248559.331233045] [server_node]: Received service client request... Sending response: 2
[INFO] [1667248559.331454604] [server_node]: Got response: 2
[INFO] [1667248559.331478869] [server_node]: Sending request: 3
[INFO] [1667248559.331500609] [server_node]: Waiting for response: 3
[INFO] [1667248559.331623666] [server_node]: Received service client request... Sending response: 3
[INFO] [1667248559.331833534] [server_node]: Got response: 3
[INFO] [1667248559.331858362] [server_node]: Sending request: 4
[INFO] [1667248559.331879729] [server_node]: Waiting for response: 4
[INFO] [1667248559.331984129] [server_node]: Received service client request... Sending response: 4
[INFO] [1667248559.332344850] [server_node]: Got response: 4
.......
.......
.......
[INFO] [1667248559.887715670] [server_node]: Sending request: 2704
[INFO] [1667248559.887733614] [server_node]: Waiting for response: 2704
[INFO] [1667248559.887767825] [server_node]: Received service client request... Sending response: 2704
[INFO] [1667248559.887899850] [server_node]: Got response: 2704
[INFO] [1667248559.887913152] [server_node]: Sending request: 2705
[INFO] [1667248559.887931926] [server_node]: Waiting for response: 2705
[INFO] [1667248559.887967492] [server_node]: Received service client request... Sending response: 2705
[INFO] [1667248559.888097595] [server_node]: Got response: 2705
[INFO] [1667248559.888119401] [server_node]: Sending request: 2706
[INFO] [1667248559.888141074] [server_node]: Waiting for response: 2706
<end of output>
Test time =  60.06 sec
----------------------------------------------------------
Test Failed.
"test-regular-ros2-service" end time: Oct 31 20:36 UTC
"test-regular-ros2-service" time elapsed: 00:01:00
----------------------------------------------------------

End testing: Oct 31 20:36 UTC

gtest =  60.06 sec*proc

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (10 by maintainers)

Most upvoted comments

@iuhilnehc-ynos Thanks for your tip, I have created PR https://github.com/ros2/rclcpp/pull/2044 to fix this issue.

The case makes sense to me.

The reason why future.get() is blocked is that promise does not set_value in handle_response.

If the request from client to service and the response from service to client is quick enough that done before putting the sequence_number into the pending_requests_, calling get_and_erase_pending_request in the handle_response can’t get that there was the sequence_number requested before.

I think that https://github.com/ros2/rclcpp/blob/7c6785176a878cf16eb98995b732f405b9e776ed/rclcpp/include/rclcpp/client.hpp#L800 should put before calling rcl_send_request.

Thanks @iuhilnehc-ynos, I tested the proposed fix and the example now passes reliably.

@jefferyyjhsu, thanks for the confirmation to let me know there is still an issue for rmw_fastrtps. I did the test yesterday, and I can reproduce the issue using rmw_fastrtps in whatever the humble or rolling (I didn’t use docker to test).

From the https://github.com/ros2/rclcpp/issues/2039#issuecomment-1309400835, I think that the issue for rmw_cyclonedds_cpp is fixed by https://github.com/ros2/rclcpp/pull/2044. (of course, it’s a rclcpp’s bug)

I am afraid that another bug exists in the Fast-DDS, but it needs to confirm by eProsima. I’d like to share my understanding below,

https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/WaitSetImpl.cpp#L159 doesn’t hold WaitSetImpl::mutex_, but called in https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/ConditionNotifier.cpp#L58.

https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/WaitSetImpl.cpp#L131 might lose a condition notification when condition_variable is waiting for the fill_active_conditions.

A case such as https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/WaitSetImpl.cpp#L123 with ret_val(false), and then there is a notification triggered at https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/StatusConditionImpl.cpp#L90.

A possible solution,

diff --git a/src/cpp/fastdds/core/condition/WaitSetImpl.cpp b/src/cpp/fastdds/core/condition/WaitSetImpl.cpp
index 73390e35d..ecd74b735 100644
--- a/src/cpp/fastdds/core/condition/WaitSetImpl.cpp
+++ b/src/cpp/fastdds/core/condition/WaitSetImpl.cpp
@@ -68,7 +68,7 @@ ReturnCode_t WaitSetImpl::attach_condition(
             // Should wake_up when adding a new triggered condition
             if (is_waiting_ && condition.get_trigger_value())
             {
-                wake_up();
+                cond_.notify_one();
             }
         }
     }
@@ -156,6 +156,7 @@ ReturnCode_t WaitSetImpl::get_conditions(
 
 void WaitSetImpl::wake_up()
 {
+    std::lock_guard<std::mutex> guard(mutex_);
     cond_.notify_one();
 }

could you help double-check if the issue still happened after applying the above patch for Fast-DDS?

@MiguelCompany @fujitatomoya @llapx Do you have any suggestions?

@iuhilnehc-ynos I think you may be right. The fix seems an easy one, but adding a regression test might be difficult. I’ll try to think of something …

@fujitatomoya

can you make sure https://github.com/ros2/rclcpp/issues/2039#issuecomment-1303160571 is the case for this issue

Confirmed, and it works well (the test case passed, no hang happend).

with using gdb or additional logging message?

Not yet, I will have a try.