rclcpp: High CPU usage when using DDS intra-process communication

CPU profiling of a ROS2 application shows that using the DDS intra-process communication uses almost twice CPU time than using the rclcpp IPC (handled by the intra-process manager).

This happens due the executor makes a copy of the message when a subscription is ready to take it: https://github.com/ros2/rclcpp/blob/7d8b2690ff188672bfe3703d8b3c074b87303931/rclcpp/src/rclcpp/executor.cpp#L637

The copy is made regardless of the use of IPC by the DDS middleware, where the message might be passed from the publisher to the subscriber without actually copying it.

The following plot shows the total CPU time spent to publish a 100Kb message and receive it on the subscription.

  • Tested with CycloneDDS and FastDDS, both with rclcpp IPC ON and OFF, using the StaticSingleThreadedExecutor on a single core platform (RPi 1) at 500Mhz:

Publisher → Msg: 100kb ← Subscriber image The graph below shows a FlameGraph (CPU profiling) of a similar system, using the DDS intra-process (rclcpp IPC disabled)

Publisher → Msg: 8mb at 10Hz ← Subscriber image We can see memmove/memset operations called twice, once by the publisher when creating the message, and once by the executor when taking the subscription message on execute_subscription(), where a new message is created.

We’re planning to fix this, maybe reusing some of the logic used to take loaned messages, but first we’d like to know if this is a known issue and maybe there’s already a plan for a fix, or a proposal on a possible fix implementation?

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 3
  • Comments: 27 (7 by maintainers)

Most upvoted comments

Benchmark results looking good with LoanedMessage!

Test: Pub/Sub system passing a 100kb message at 40Hz
  1 Default use case IPC ON (is rclcpp's IPC, so no copies). User allocates Msg
  2 Default use case IPC OFF (the executor does a copy of the message) User allocates Msg
  3 Loaned message IPC ON - intra_process_publish(LoanedMessage);
  4 Loaned message IPC OFF - inter_process_publish(LoanedMessage);

image

@mauropasse The DataWriter is a wrapper of both an RTPSWriter and a WriterHistory. It is able to perform intraprocess deliveries, but that doesn’t mean it will avoid the serialization (i.e. copy).

On its travel from the user sample on the publishing side to the user sample on the subscribing side, the following copies may be performed:

  • Serialization from the user sample into the writer history during the DataWriter::write call (this is avoided when the input data comes from a loan)
  • From the writer’s history to the reader’s history
    • On intraprocess this is automatically avoided.
    • For readers on the same host and fully bounded data types, no copy is made, as the writer’s history is on a shared-memory mmap’d file.
    • For other cases, the transport is used, which requires:
      • Encapsulation (1 copy)
      • Data transfer (1 copy with UDP, no copies with SHM)
      • Desencapsulation (1 copy)
  • Deserialization from the reader’s history to the user sample. That depends on how DataReader::take is called. When loans are used and the type is plain, a pointer to the serialized payload (which is binary compatible with the user type) is returned.

I hope this long explanation is clear enough.

Yes, if it’s possible we should try to reuse the logic for LoanedMessages.

The rclcpp code shouldn’t really care about whether the message comes from the same process or from a different process via a shared memory layer, as long as the underlying DDS supports optimized intra-process communication.

We can see memmove/memset operations called twice, once by the publisher when creating the message, and once by the executor when taking the subscription message on execute_subscription(), where a new message is created.

The problem is that this line is zero-initializing the message, right? https://github.com/ros2/rclcpp/blob/7d8b2690ff188672bfe3703d8b3c074b87303931/rclcpp/src/rclcpp/executor.cpp#L637

(make sure you use permalinks, if not the link will be poiniting to another part of the code when updated)

The difference between zero and default initialization in cpp is extremely subtle, but it’s also possible to achieve the seconde one.

e.g.

auto data = new T;  // default initialized, there won't be any memset for POD, it starts "uninitialized"
auto data = new T();   // this one is zero-initialized :)

we’re using

https://github.com/ros2/rclcpp/blob/1037822a63330495dcf0de5e8f20544375a5f116/rclcpp/include/rclcpp/message_memory_strategy.hpp#L87

which is equivalent to zero-initialization (uses ::new (pv) T(v) according to cppreference). When using allocators it’s a bit harder to get default initialization (c++20 introduced allocate_shared_for_overwrite 😕 ), but something like

  // allocate
  auto message_ptr = std::allocator_traits<MessageAlloc>::allocate(*message_allocator_.get(), sizeof(MessageT));
  try {
    // placement new, using default initialization
    message_ptr = new(message_ptr) MessageAlloc;
    return std::shared_ptr(
      message_ptr,
      [message_allocator_] (MessageT * p) {
        std::allocator_traits<MessageAlloc>::deallocate(*message_allocator_.get(), message_ptr, sizeof(MessageT));
      },
      *message_allocator_.get());
  } catch (...) {
    // deallocate here
    auto std::allocator_traits<MessageAlloc>::deallocate(*message_allocator_.get(), message_ptr, sizeof(MessageT));
    throw;
  }

should do it. (don’t expect that snippet to work, I haven’t actually tried it 😃 )

Another thing to check is the default constructors of the message we’re generating. Those might be forcing zero-initialization, but AFAIR that was not the case.

@mauropasse

thanks 👍 but i am not sure how we can tell latency difference from this graph since it only shows ratio how much it consumes CPU…

  • rclcpp intra-process

    rclcpp::executors::StaticSingleThreadedExecutor::spin is 47.01%, which includes publish based on timer and subscription to call user callback.

  • Fast-DDS intra-process

    eprosima::fastrtps::rtps::AsyncWriterThread::run is 18.0% (as you mentioned), and rclcpp::executors::StaticSingleThreadedExecutor::spin is 28.02%. as in total, it is 46%.

i think time consuming ratio is almost same? just checking if i am missing something.

@fujitatomoya those results are with Fast-DDS, using the instructions provided by @MiguelCompany.

The bounded size data type restriction is also in Cyclone, and it besides needs a few more changes before it will be willing to do this in the absence of shared memory (that’s a bookkeeping detail).

Frankly, I’m shocked that there are 2 memsets in the profile and that they together consume 50% of the time. Cyclone at least has a good argument for calling memmove when not doing loaned messages, but the time in memset seems really wasteful as it is basically spent inside the allocation of a new message that then (presumably) will be filled with data. I guess that’s standard C++ behaviour, but you might want to look into that.