rmw_cyclonedds: slow publishing and performance for custom messages with large arrays

Bug report

Required Info:

Operating System:
- Ubuntu 20.04
Installation type:
- ROS2 galactic standard ubuntu package installed via apt
Version or commit hash: This is what the apt package info says: ros-galactic-cyclonedds/focal,now 0.8.0-5focal.20210608.002038 amd64 [installed,automatic]
DDS implementation:
- cyclonedds
Client library (if applicable):
- rclcpp
Hardware: AMD Ryzen 7 4800H with 64GB of memory

Steps to reproduce issue

I have made a very small repo with the below demo code and instructions how to run: https://github.com/berndpfrommer/ros2_issues Here is the source code for the publisher:

#include <unistd.h>

#include <rclcpp/rclcpp.hpp>
#include <ros2_issues/msg/test_array_complex.hpp>
#include <thread>

template <class MsgType>
struct TestPublisher : public rclcpp::Node
{
  explicit TestPublisher(const rclcpp::NodeOptions & options)
  : Node("test_publisher", options)
  {
    pub_ = create_publisher<MsgType>(
      "~/array", declare_parameter<int>("q_size", 1000));
    thread_ = std::thread([this]() {
      rclcpp::Rate rate(declare_parameter<int>("rate", 1000));
      const int numElements = declare_parameter<int>("num_elements", 100);
      rclcpp::Time t_start = now();
      size_t msg_cnt(0);
      const rclcpp::Duration logInterval = rclcpp::Duration::from_seconds(1.0);
      while (rclcpp::ok()) {
        MsgType msg;
        msg.header.stamp = now();
        msg.elements.resize(numElements);
        pub_->publish(msg);
        rate.sleep();
        msg_cnt++;
        rclcpp::Time t = now();
        const rclcpp::Duration dt = t - t_start;
        if (dt > logInterval) {
          RCLCPP_INFO(get_logger(), "pub rate: %8.2f", msg_cnt / dt.seconds());
          t_start = t;
          msg_cnt = 0;
        }
      }
    });
  }
  // -- variables                                                                                                                                                                                                  
  typename rclcpp::Publisher<MsgType>::SharedPtr pub_;
  std::thread thread_;
};

int main(int argc, char * argv[])
{
  rclcpp::init(argc, argv);
  auto node =
    std::make_shared<TestPublisher<ros2_issues::msg::TestArrayComplex>>(
      rclcpp::NodeOptions());
  rclcpp::spin(node);
  rclcpp::shutdown();
  return 0;
}

It uses the following custom message for TestArrayComplex:

std_msgs/Header header
# test array of elements
TestElement[] elements

and the TestElement of the array is defined as:

uint16 x
uint16 y
builtin_interfaces/Time ts
bool polarity

Expected behavior

Under ROS1 I can publish 1000 msgs/sec with 100,000 elements per message and receive at a rate of 1000Hz with rostopic hz

Actual behavior

Under ROS2 (galactic), already the publishing fails to keep up at a message size of 5,000 elements. Running the publisher with

RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 run ros2_issues publisher_node --ros-args -p num_elements:=5000 -p rate:=1000

produces this output:

634242922.738549 [0] publisher_: using network interface wlo1 (udp/192.168.1.234) selected arbitrarily from: wlo1, virbr0, virbr1, docker0
[INFO] [1634242923.747148904] [test_publisher]: pub rate:   369.34
[INFO] [1634242924.749220049] [test_publisher]: pub rate:   373.22
...

So not even the publishing is full speed, without any subscriber to the topic. I see the publisher running at 100 %CPU, so something is really heavy weight about publishing. Worse, running rostopic hz shows a rate of about 30 msg/s.

RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 topic hz -w 100 /test_publisher/array
1634243031.116776 [0]       ros2: using network interface wlo1 (udp/192.168.1.234) selected arbitrarily from: wlo1, virbr0, virbr1, docker0
average rate: 31.275
        min: 0.030s max: 0.041s std dev: 0.00314s window: 33
average rate: 30.502
        min: 0.030s max: 0.044s std dev: 0.00351s window: 63

This is what I get from rostopic bw. The size of the message (about 80kb) agrees with what I computed by hand:

5.20 MB/s from 100 messages
        Message size mean: 0.08 MB min: 0.08 MB max: 0.08 MB

I tried sudo` sysctl -w net.core.rmem_max=8388608 net.core.rmem_default=8388608 and also was able to restrict the interface to loopback (lo) but no improvement.

FastRTPS is a bit better, at least here I can send messages with up to 50,000 elements before it falls off at 110,000 messages:

RMW_IMPLEMENTATION=rmw_fastrtps_cpp ros2 run ros2_issues publisher_node_complex --ros-args -p num_elements:=110000 -p rate:=1000                                       
[INFO] [1634243372.579736663] [test_publisher]: pub rate:   404.43
[INFO] [1634243373.581322637] [test_publisher]: pub rate:   391.37
[INFO] [1634243374.583210605] [test_publisher]: pub rate:   414.22

But if I send messages of size 5,000, rostopic hz also shows about 30Hz, similar to rmw_cyclonedds_cpp.

Additional information

This is a show stopper for porting e.g. a driver for an event based camera from ROS1 to ROS2, see here: https://github.com/berndpfrommer/metavision_ros_driver.

The hardware is a 8-core AMD Ryzen laptop, less than 1 year old, so definitely not a slow machine, and this is all running on-machine, no network traffic.

To run the above code it is fastest to clone the very small repo linked above.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 21 (4 by maintainers)

Most upvoted comments

I had a similar issue here and found a solution for my environment.

humble, navigation2, cyclone dds
a large-size static map (3000 x 2000), global costmap always publishes the full map (1Hz)
rviz2 and ros2 bag record subscribe to the global costmap, velodyne_points and so on
velodyne_packets is 10Hz, but velodyne_points is around 6Hz and delayed

I checked some documents/comments about tuning.

mentions only rmem
- https://docs.ros.org/en/rolling/How-To-Guides/DDS-tuning.html#cyclone-dds-tuning
- https://github.com/ros2/rmw_cyclonedds#performance-recommendations
mentions both rmem and wmem
- https://discourse.ros.org/t/latency-and-throughput-in-ros2/4367/17

Only rmem settings did not work, but both rmem and wmem work like a charm! In my case -16MB did not fix the issue, but 64MB works well.

sudo sysctl -w net.core.rmem_max=67108864 net.core.rmem_default=67108864
sudo sysctl -w net.core.wmem_max=67108864 net.core.wmem_default=67108864

daisukes on May 16, 2023

sudo sysctl -w net.core.rmem_max=67108864 net.core.rmem_default=67108864 sudo sysctl -w net.core.wmem_max=67108864 net.core.wmem_default=67108864

This is a working solution. I tested on Ubuntu 22.04 with humble and cyclone dds.

jimmyshe on Sep 22, 2023

My issues so far have been entirely in C++.

From what I can tell, basically the issue boils down to the way serialization in (at least) rmw_cyclonedds being a extremely naïve implementation. For every single non-POD type, you incur a hashmap lookup and multiple function calls of indirection.

My problem here is that doing high performance message serialization if you know the messages at compile time is basically a solved problem. Things like google protobuf came out in 2001. Did no one look at how it (or any competing message serialization libraries) worked?

How this should work is that the codegen step should generate a single function (or possibly several functions) that handles (de)serializaing a complete message of <type> from the wire format to a C++/Python data structure. The entire dynamic lookup parsing disaster in the middleware is a mistake, and it also has the side-effect of preventing things like the compiler inlining functions.

ROS2 doesn’t support defining new messages at runtime, so there should be no dynamic runtime message parsing at all.

fake-name on Jun 25, 2022