rclcpp: High CPU usage when using DDS intra-process communication
CPU profiling of a ROS2 application shows that using the DDS intra-process communication uses almost twice CPU time than using the rclcpp IPC (handled by the intra-process manager).
This happens due the executor makes a copy of the message when a subscription is ready to take it: https://github.com/ros2/rclcpp/blob/7d8b2690ff188672bfe3703d8b3c074b87303931/rclcpp/src/rclcpp/executor.cpp#L637
The copy is made regardless of the use of IPC by the DDS middleware, where the message might be passed from the publisher to the subscriber without actually copying it.
The following plot shows the total CPU time spent to publish a 100Kb message and receive it on the subscription.
- Tested with CycloneDDS and FastDDS, both with rclcpp IPC ON and OFF, using the
StaticSingleThreadedExecutoron a single core platform (RPi 1) at 500Mhz:
Publisher → Msg: 100kb ← Subscriber
The graph below shows a FlameGraph (CPU profiling) of a similar system, using the DDS intra-process (rclcpp IPC disabled)
Publisher → Msg: 8mb at 10Hz ← Subscriber
We can see memmove/memset operations called twice, once by the publisher when creating the message, and once by the executor when taking the subscription message on execute_subscription(), where a new message is created.
We’re planning to fix this, maybe reusing some of the logic used to take loaned messages, but first we’d like to know if this is a known issue and maybe there’s already a plan for a fix, or a proposal on a possible fix implementation?
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 3
- Comments: 27 (7 by maintainers)
Benchmark results looking good with
LoanedMessage!@mauropasse The
DataWriteris a wrapper of both anRTPSWriterand aWriterHistory. It is able to perform intraprocess deliveries, but that doesn’t mean it will avoid the serialization (i.e. copy).On its travel from the user sample on the publishing side to the user sample on the subscribing side, the following copies may be performed:
DataWriter::writecall (this is avoided when the input data comes from a loan)DataReader::takeis called. When loans are used and the type is plain, a pointer to the serialized payload (which is binary compatible with the user type) is returned.I hope this long explanation is clear enough.
Yes, if it’s possible we should try to reuse the logic for
LoanedMessages.The rclcpp code shouldn’t really care about whether the message comes from the same process or from a different process via a shared memory layer, as long as the underlying DDS supports optimized intra-process communication.
The problem is that this line is zero-initializing the message, right? https://github.com/ros2/rclcpp/blob/7d8b2690ff188672bfe3703d8b3c074b87303931/rclcpp/src/rclcpp/executor.cpp#L637
(make sure you use permalinks, if not the link will be poiniting to another part of the code when updated)
The difference between zero and default initialization in cpp is extremely subtle, but it’s also possible to achieve the seconde one.
e.g.
we’re using
https://github.com/ros2/rclcpp/blob/1037822a63330495dcf0de5e8f20544375a5f116/rclcpp/include/rclcpp/message_memory_strategy.hpp#L87
which is equivalent to zero-initialization (uses
::new (pv) T(v)according to cppreference). When using allocators it’s a bit harder to get default initialization (c++20 introduced allocate_shared_for_overwrite 😕 ), but something likeshould do it. (don’t expect that snippet to work, I haven’t actually tried it 😃 )
Another thing to check is the default constructors of the message we’re generating. Those might be forcing zero-initialization, but AFAIR that was not the case.
@mauropasse
thanks 👍 but i am not sure how we can tell latency difference from this graph since it only shows ratio how much it consumes CPU…
rclcpp intra-processrclcpp::executors::StaticSingleThreadedExecutor::spinis 47.01%, which includes publish based on timer and subscription to call user callback.Fast-DDS intra-processeprosima::fastrtps::rtps::AsyncWriterThread::runis 18.0% (as you mentioned), andrclcpp::executors::StaticSingleThreadedExecutor::spinis 28.02%. as in total, it is 46%.i think time consuming ratio is almost same? just checking if i am missing something.
@fujitatomoya those results are with Fast-DDS, using the instructions provided by @MiguelCompany.
The bounded size data type restriction is also in Cyclone, and it besides needs a few more changes before it will be willing to do this in the absence of shared memory (that’s a bookkeeping detail).
Frankly, I’m shocked that there are 2
memsets in the profile and that they together consume 50% of the time. Cyclone at least has a good argument for callingmemmovewhen not doing loaned messages, but the time inmemsetseems really wasteful as it is basically spent inside the allocation of a new message that then (presumably) will be filled with data. I guess that’s standard C++ behaviour, but you might want to look into that.