rmw_cyclonedds: slow publishing and performance for custom messages with large arrays
Bug report
Required Info:
- Operating System:
- Ubuntu 20.04
- Installation type:
- ROS2 galactic standard ubuntu package installed via apt
- Version or commit hash:
This is what the apt package info says:
ros-galactic-cyclonedds/focal,now 0.8.0-5focal.20210608.002038 amd64 [installed,automatic] - DDS implementation:
- cyclonedds
- Client library (if applicable):
- rclcpp
- Hardware: AMD Ryzen 7 4800H with 64GB of memory
Steps to reproduce issue
I have made a very small repo with the below demo code and instructions how to run: https://github.com/berndpfrommer/ros2_issues Here is the source code for the publisher:
#include <unistd.h>
#include <rclcpp/rclcpp.hpp>
#include <ros2_issues/msg/test_array_complex.hpp>
#include <thread>
template <class MsgType>
struct TestPublisher : public rclcpp::Node
{
explicit TestPublisher(const rclcpp::NodeOptions & options)
: Node("test_publisher", options)
{
pub_ = create_publisher<MsgType>(
"~/array", declare_parameter<int>("q_size", 1000));
thread_ = std::thread([this]() {
rclcpp::Rate rate(declare_parameter<int>("rate", 1000));
const int numElements = declare_parameter<int>("num_elements", 100);
rclcpp::Time t_start = now();
size_t msg_cnt(0);
const rclcpp::Duration logInterval = rclcpp::Duration::from_seconds(1.0);
while (rclcpp::ok()) {
MsgType msg;
msg.header.stamp = now();
msg.elements.resize(numElements);
pub_->publish(msg);
rate.sleep();
msg_cnt++;
rclcpp::Time t = now();
const rclcpp::Duration dt = t - t_start;
if (dt > logInterval) {
RCLCPP_INFO(get_logger(), "pub rate: %8.2f", msg_cnt / dt.seconds());
t_start = t;
msg_cnt = 0;
}
}
});
}
// -- variables
typename rclcpp::Publisher<MsgType>::SharedPtr pub_;
std::thread thread_;
};
int main(int argc, char * argv[])
{
rclcpp::init(argc, argv);
auto node =
std::make_shared<TestPublisher<ros2_issues::msg::TestArrayComplex>>(
rclcpp::NodeOptions());
rclcpp::spin(node);
rclcpp::shutdown();
return 0;
}
It uses the following custom message for TestArrayComplex:
std_msgs/Header header
# test array of elements
TestElement[] elements
and the TestElement of the array is defined as:
uint16 x
uint16 y
builtin_interfaces/Time ts
bool polarity
Expected behavior
Under ROS1 I can publish 1000 msgs/sec with 100,000 elements per message and receive at a rate of 1000Hz with rostopic hz
Actual behavior
Under ROS2 (galactic), already the publishing fails to keep up at a message size of 5,000 elements. Running the publisher with
RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 run ros2_issues publisher_node --ros-args -p num_elements:=5000 -p rate:=1000
produces this output:
634242922.738549 [0] publisher_: using network interface wlo1 (udp/192.168.1.234) selected arbitrarily from: wlo1, virbr0, virbr1, docker0
[INFO] [1634242923.747148904] [test_publisher]: pub rate: 369.34
[INFO] [1634242924.749220049] [test_publisher]: pub rate: 373.22
...
So not even the publishing is full speed, without any subscriber to the topic. I see the publisher running at 100 %CPU, so something is really heavy weight about publishing.
Worse, running rostopic hz shows a rate of about 30 msg/s.
RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 topic hz -w 100 /test_publisher/array
1634243031.116776 [0] ros2: using network interface wlo1 (udp/192.168.1.234) selected arbitrarily from: wlo1, virbr0, virbr1, docker0
average rate: 31.275
min: 0.030s max: 0.041s std dev: 0.00314s window: 33
average rate: 30.502
min: 0.030s max: 0.044s std dev: 0.00351s window: 63
This is what I get from rostopic bw. The size of the message (about 80kb) agrees with what I computed by hand:
5.20 MB/s from 100 messages
Message size mean: 0.08 MB min: 0.08 MB max: 0.08 MB
I tried sudo` sysctl -w net.core.rmem_max=8388608 net.core.rmem_default=8388608 and also was able to restrict the interface to loopback (lo) but no improvement.
FastRTPS is a bit better, at least here I can send messages with up to 50,000 elements before it falls off at 110,000 messages:
RMW_IMPLEMENTATION=rmw_fastrtps_cpp ros2 run ros2_issues publisher_node_complex --ros-args -p num_elements:=110000 -p rate:=1000
[INFO] [1634243372.579736663] [test_publisher]: pub rate: 404.43
[INFO] [1634243373.581322637] [test_publisher]: pub rate: 391.37
[INFO] [1634243374.583210605] [test_publisher]: pub rate: 414.22
But if I send messages of size 5,000, rostopic hz also shows about 30Hz, similar to rmw_cyclonedds_cpp.
Additional information
This is a show stopper for porting e.g. a driver for an event based camera from ROS1 to ROS2, see here: https://github.com/berndpfrommer/metavision_ros_driver.
The hardware is a 8-core AMD Ryzen laptop, less than 1 year old, so definitely not a slow machine, and this is all running on-machine, no network traffic.
To run the above code it is fastest to clone the very small repo linked above.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 21 (4 by maintainers)
I had a similar issue here and found a solution for my environment.
rviz2andros2 bag recordsubscribe to the global costmap, velodyne_points and so onI checked some documents/comments about tuning.
Only rmem settings did not work, but both rmem and wmem work like a charm! In my case -16MB did not fix the issue, but 64MB works well.
This is a working solution. I tested on Ubuntu 22.04 with humble and cyclone dds.
My issues so far have been entirely in C++.
From what I can tell, basically the issue boils down to the way serialization in (at least)
rmw_cycloneddsbeing a extremely naïve implementation. For every single non-POD type, you incur a hashmap lookup and multiple function calls of indirection.My problem here is that doing high performance message serialization if you know the messages at compile time is basically a solved problem. Things like google protobuf came out in 2001. Did no one look at how it (or any competing message serialization libraries) worked?
How this should work is that the codegen step should generate a single function (or possibly several functions) that handles (de)serializaing a complete message of
<type>from the wire format to a C++/Python data structure. The entire dynamic lookup parsing disaster in the middleware is a mistake, and it also has the side-effect of preventing things like the compiler inlining functions.ROS2 doesn’t support defining new messages at runtime, so there should be no dynamic runtime message parsing at all.