ros2cli: Nodes missing from `ros2 node list` after relaunch

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04
  • Installation type:
    • Foxy binaries
  • Version or commit hash:
    • ros-foxy-navigation2 0.4.5-1focal.20201210.084248
  • DDS implementation:
    • Fast-RTPS (default)
  • Client library (if applicable):
    • n/a

Steps to reproduce issue

1

From the workspace root, launch (e.g.) a TurtleBot3 simulation:

export TURTLEBOT3_MODEL=burger
export GAZEBO_MODEL_PATH=$GAZEBO_MODEL_PATH:$(pwd)/src/turtlebot3/turtlebot3_simulations/turtlebot3_gazebo/models
ros2 launch turtlebot3_gazebo turtlebot3_world.launch.py

Then, in a second terminal, launch the navigation:

export TURTLEBOT3_MODEL=burger
ros2 launch turtlebot3_navigation2 navigation2.launch.py use_sim_time:=true

Print the node list:

ros2 node list

Close (ctrl-c) the navigation and the simulation.

2

Relaunch from the same respective terminals, the simulation:

ros2 launch turtlebot3_gazebo turtlebot3_world.launch.py

and the navigation:

ros2 launch turtlebot3_navigation2 navigation2.launch.py use_sim_time:=true

Print the node list again (2nd time):

ros2 node list

Close (ctrl-c) the navigation and the simulation. Stop the ros2 daemon:

ros2 daemon stop
3

Relaunch from the same respective terminals, the simulation:

ros2 launch turtlebot3_gazebo turtlebot3_world.launch.py

and the navigation:

ros2 launch turtlebot3_navigation2 navigation2.launch.py use_sim_time:=true

Print the node list again (3rd time):

ros2 node list

Expected behavior

The node list should be the same all three times (up to some hash in the /transform_listener_impl_... nodes).

Actual behavior

The second time, the following nodes are missing (the remainder is practically the same):

/controller_server
/controller_server_rclcpp_node
/global_costmap/global_costmap
/global_costmap/global_costmap_rclcpp_node
/global_costmap_client
/local_costmap/local_costmap
/local_costmap/local_costmap_rclcpp_node
/local_costmap_client
/planner_server
/planner_server_rclcpp_node

The third time, after stopping the daemon, it works as expected again.

Note, that everything else works fine and in case of the above navigation use case, the nodes are fully functional.

Additional information

This issue was raised here: ros-planning/navigation2#2145.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 5
  • Comments: 36 (11 by maintainers)

Most upvoted comments

I’m seeing something similar with gazebo + ros2_control as well.

The interesting thing is that if I do: ros2 node list I get 0 nodes.

If I do ros2 node list --no-daemon I get the list of nodes.

Restarting the daemon with ros2 daemon stop; ros2 daemon start also shows all nodes.

I think that this is expected behavior for ros2 daemon, it is well described what-is-ros2-daemon.

I’m seeing this bug on a project with five nodes, FastRTPS, native Ubuntu install.

I’m using ros2 launch files, everything comes up nicely the first couple of times, but eventually ros2 node list stops seeing all of the nodes (which are definitely running). At the same time, ros2 param stops being able to interact with the hidden nodes, and ros2 topic list stops showing all of the topics.

rqt is a bit weird, there were a few time when it seemed able to find a different collection of topics and nodes to the cli tools

ros2 daemon stop; ros2 daemon start has saved my day.

I hope you guys can reproduce this issue on your machine, otherwise, nobody can help confirm even if I have a workaround patch 😄 .

  1. I can’t use rmw_cyclonedds_cpp to reproduce this issue.

  2. for rmw_fastrtps_cpp, as Ctrl+C ros2 launch nav2_bringup tb3_simulation_launch.py headless:=False can’t make all processes exit normally, the shared-memory files used in the Fast-DDS are not clean successfully. I don’t know if it’s the root cause to make the ros2 daemon not update the node_listener -> rmw_dds_common::GraphCache::update_participant_entities anymore.

  3. some information about ros2 daemon

  • top info of ros2 daemon
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3648025 chenlh    20   0  667912  79412  47136 R  99.7   0.2   4:02.62 python3       # almost 100% CPU usage
3648022 chenlh    20   0  667912  79412  47136 S   0.3   0.2   0:03.56 python3
3647989 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.40 python3
3648019 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648020 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648021 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.01 python3
3648023 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.08 python3
3648024 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648026 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648027 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.05 python3
3648028 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648029 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.02 python3
  • thread info of ros2 daemon

to find out the thread 3648025 is Id 8

(gdb) info thread
  Id   Target Id                                     Frame 
* 1    Thread 0x7faf51f801c0 (LWP 3647989) "python3" 0x00007faf52099d7f in __GI___poll (fds=0x7faf513bbae0, nfds=1, timeout=7200000)
    at ../sysdeps/unix/sysv/linux/poll.c:29
  2    Thread 0x7faf4c282640 (LWP 3648019) "python3" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, 
    expected=0, futex_word=0x7faf50ceb000 <(anonymous namespace)::g_signal_handler_sem>) at ./nptl/futex-internal.c:57
  3    Thread 0x7faf4ba81640 (LWP 3648020) "python3" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7faf4ba80de0, op=137, 
    expected=0, futex_word=0x55e32f872ae0) at ./nptl/futex-internal.c:57
  4    Thread 0x7faf4b280640 (LWP 3648021) "python3" __futex_abstimed_wait_common64 (private=290346745, cancel=true, abstime=0x7faf4b27fc10, op=137, 
    expected=0, futex_word=0x55e32feb7760) at ./nptl/futex-internal.c:57
  5    Thread 0x7faf4a9f8640 (LWP 3648022) "python3" __futex_abstimed_wait_common64 (private=1326168272, cancel=true, abstime=0x7faf4a9f7c10, op=137, 
    expected=0, futex_word=0x55e32ff19bcc) at ./nptl/futex-internal.c:57
  6    Thread 0x7faf4a1f7640 (LWP 3648023) "python3" 0x00007faf520a8934 in __libc_recvfrom (fd=17, buf=0x55e32ff1c570, len=65500, flags=0, addr=..., 
    addrlen=0x7faf4a1f6a0c) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
  7    Thread 0x7faf499f6640 (LWP 3648024) "python3" 0x00007faf520a8934 in __libc_recvfrom (fd=18, buf=0x55e32ff2cd90, len=65500, flags=0, addr=..., 
    addrlen=0x7faf499f5a0c) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
  8    Thread 0x7faf491e8640 (LWP 3648025) "python3" 0x00007faf500de664 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
  9    Thread 0x7faf489e7640 (LWP 3648026) "python3" 0x00007faf520a8934 in __libc_recvfrom (fd=20, buf=0x55e32ff40070, len=65500, flags=0, addr=..., 
    addrlen=0x7faf489e6a0c) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
  10   Thread 0x7faf481d9640 (LWP 3648027) "python3" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7faf481d8940, 
    op=265, expected=0, futex_word=0x7faf470c9110) at ./nptl/futex-internal.c:57
  11   Thread 0x7faf478f8640 (LWP 3648028) "python3" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x55e32ff54a28) at ./nptl/futex-internal.c:57
  12   Thread 0x7faf46d57640 (LWP 3648029) "python3" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7faf30000c04) at ./nptl/futex-internal.c:57

the backtrace for thread Id 8,

(gdb) thread 8
[Switching to thread 8 (Thread 0x7faf491e8640 (LWP 3648025))]
#0  0x00007faf500df636 in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
(gdb) bt
#0  0x00007faf500df636 in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#1  0x00007faf4f6b4163 in eprosima::fastdds::rtps::SharedMemManager::find_segment (this=0x55e32fd29aa0, id=...)
    at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemManager.hpp:1282
#2  0x00007faf4f6b22f1 in eprosima::fastdds::rtps::SharedMemManager::Listener::pop (this=0x55e32ff2ccf0)
    at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemManager.hpp:711
#3  0x00007faf4f6b58fb in eprosima::fastdds::rtps::SharedMemChannelResource::Receive (this=0x55e32fe3b100, remote_locator=...)
    at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemChannelResource.hpp:182
#4  0x00007faf4f6b556e in eprosima::fastdds::rtps::SharedMemChannelResource::perform_listen_operation (this=0x55e32fe3b100, input_locator=...)
    at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemChannelResource.hpp:133
#5  0x00007faf4f6d0579 in std::__invoke_impl<void, void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> (
    __f=@0x55e32ff3fa78: (void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastdds::rtps::SharedMemChannelResource * const, eprosima::fastrtps::rtps::Locator_t)) 0x7faf4f6b54e4 <eprosima::fastdds::rtps::SharedMemChannelResource::perform_listen_operation(eprosima::fastrtps::rtps::Locator_t)>, __t=@0x55e32ff3fa70: 0x55e32fe3b100) at /usr/include/c++/11/bits/invoke.h:74
#6  0x00007faf4f6d00e2 in std::__invoke<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> (
    __fn=@0x55e32ff3fa78: (void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastdds::rtps::SharedMemChannelResource * const, eprosima::fastrtps::rtps::Locator_t)) 0x7faf4f6b54e4 <eprosima::fastdds::rtps::SharedMemChannelResource::perform_listen_operation(eprosima::fastrtps::rtps::Locator_t)>) at /usr/include/c++/11/bits/invoke.h:96
#7  0x00007faf4f6cfeb3 in std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> >::_M_invoke<0ul, 1ul, 2ul> (this=0x55e32ff3fa58)
    at /usr/include/c++/11/bits/std_thread.h:253
#8  0x00007faf4f6cf952 in std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> >::operator() (this=0x55e32ff3fa58)
    at /usr/include/c++/11/bits/std_thread.h:260
#9  0x00007faf4f6cf218 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> > >::_M_run (this=0x55e32ff3fa50)
    at /usr/include/c++/11/bits/std_thread.h:211
#10 0x00007faf501c42b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007faf52015b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007faf520a7a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

https://github.com/eProsima/Fast-DDS/blob/7e12e8fe2cebf27c621263fa544f94b099504808/src/cpp/rtps/transport/shared_mem/SharedMemChannelResource.hpp#L128-L136

    void perform_listen_operation(
            Locator input_locator)
    {
        Locator remote_locator;

        while (alive())
        {
            // Blocking receive.
            std::shared_ptr<SharedMemManager::Buffer> message;

            if (!(message = Receive(remote_locator)))
                            //////\ expect that the `Receive` can block if there is no data, but it will try to Receive the nullptr message again and again.
            {
                continue;
            }

failed to Receive by pop the message as find_segment throws an exception inside.

I don’t know whether it’s a bug or not because I can’t reproduce this issue the first time after clearing the related shm files /dev/shm/*fastrtps*.

This issue is not easy to reproduce.

But it must still be there because I can reproduce this issue with rolling (the reproducible steps are similar to https://github.com/ros2/ros2cli/issues/582#issue-784108824) a few times. After stopping the ros2 daemon in step 2 of https://github.com/ros2/ros2cli/issues/582#issue-784108824, we can immediately get the correct result of the node list.

1. ros2 daemon stop (stop ros2 daemon if it ran before)
2. ros2 launch nav2_bringup tb3_simulation_launch.py headless:=False
3. ros2 node list | wc -l (to show 31 is good currently)
4. ctrl+c to stop step 2 and then re-launch it, re-check step 3 again

Notice that the navigation demo runs well even if the ros2 node list is incorrect.

image

@iuhilnehc-ynos

can you evaluate 2 PRs introduced by https://github.com/ros2/rmw_fastrtps/issues/699#issuecomment-1653795722 with reproducible procedure in this issue?

I tried other methods such as stopping and restarting the daemon and that seemed to work, but I felt apprehensive of that workaround as I don’t fully understand the consequences.

downside could be discovery time for any other nodes running on that host system. daemon caches and advertises ros 2 network graph in it, then if the daemon is running, other ros 2 nodes running in the same host can find the connectivity to request the daemon without waiting entire discovery.

What does --spin-time do?

we can use this option to wait for ros 2 network graph updated until specific timeout expires. but this option is only valid when daemon is not running or --no-daemon option is specified.

I have tested it on ros:rolling (docker), and build turtlebot3 and navigation2 (ros:rolling no providing nav2 packages) from sources, after testing for many times, it works well.

@iuhilnehc-ynos great news! thanks for checking.

What I found what worked was adding --spin-time parameter in the call: ros2 node list --spin-time 5 That always seemed to populate the node list correctly. I hope this helps others.

Currently having this problem as well, but --spin-time does not work for me. The only workaround that works is using the --no-daemon option. Other commands such as ros2 param list also do not work. I’m running only a single node on humble, Ubuntu 22.04 (LTS).

Restarting the daemon also does not seem to solve the problem.

No idea if it helps, but here is the output of ros2 doctor --report while my node is running:

/opt/ros/humble/lib/python3.10/site-packages/ros2doctor/api/__init__.py: 154: UserWarning: Fail to call PackageReport class functions.

   NETWORK CONFIGURATION
inet         : 127.0.0.1
inet4        : ['127.0.0.1']
inet6        : ['::1']
netmask      : 255.0.0.0
device       : lo
flags        : 73<RUNNING,UP,LOOPBACK>
mtu          : 65536
inet         : 192.168.220.61
inet4        : ['192.168.220.61']
ether        : 3c:a9:f4:17:ec:08
inet6        : ['fe80::e214:a874:3128:3e04%wlo1']
netmask      : 255.255.0.0
device       : wlo1
flags        : 4163<BROADCAST,UP,MULTICAST,RUNNING>
mtu          : 1500
broadcast    : 192.168.255.255
ether        : 2c:59:e5:03:b0:46
device       : enp0s25
flags        : 4099<BROADCAST,UP,MULTICAST>
mtu          : 1500

   PLATFORM INFORMATION
system           : Linux
platform info    : Linux-5.19.0-35-generic-x86_64-with-glibc2.35
release          : 5.19.0-35-generic
processor        : x86_64

   QOS COMPATIBILITY LIST
topic [type]            : /parameter_events [rcl_interfaces/msg/ParameterEvent]
publisher node          : _ros2cli_daemon_42_3d320951c78f477dbb7ee7a28c576fda
subscriber node         : _NODE_NAME_UNKNOWN_
compatibility status    : OK
topic [type]            : /parameter_events [rcl_interfaces/msg/ParameterEvent]
publisher node          : _NODE_NAME_UNKNOWN_
subscriber node         : _NODE_NAME_UNKNOWN_
compatibility status    : OK

   RMW MIDDLEWARE
middleware name    : rmw_fastrtps_cpp

   ROS 2 INFORMATION
distribution name      : humble
distribution type      : ros2
distribution status    : active
release platforms      : {'debian': ['bullseye'], 'rhel': ['8'], 'ubuntu': ['jammy']}

   TOPIC LIST
topic               : none
publisher count     : 0
subscriber count    : 0

Again, not sure if helpful, but when I installed ROS2, I added the following lines to ~/.bashrc:

# ROS 2 configs
source /opt/ros/humble/setup.bash
export ROS_DOMAIN_ID=42
export ROS_LOCALHOST_ONLY=1

can you point out which node or processes cannot exit normally? is that receiving exception or core crash?

Press ctrl+c for ros2 launch nav2_bringup tb3_simulation_launch.py headless:=False has different behavior each time, but most errors are from rviz2 and component_container_isolated, which might be killed by ros2 launch.

Is that always the same node which cannot be listed or random node?

It shows a random node list, but if the issue happens, the node list is almost the same as the prior while running the tb3_simulation_launch.py again, but some node names with new IDs are refreshed, such as the launch node /launch_ros_{a_new_pid}.

  • if we add the procedure fastdds shm clean in this procedure, problem cannot happen?

No, I tried using fastdds shm clean, but it is not enough because shared memory files for data communication are used in the node of ros2 daemon. I must stop ros2 daemon.

BTW: I think it’s not difficult to reproduce this issue. Please don’t be polite to the tb3_simulation_launch.py(Press ctrl+c any time you can to stop it and rerun it immediately). I have confirmed this issue with both humble and rolling.

I have not noticed this bug in Galactic, but I encountered it immediately again when I used Humble. I have seen https://github.com/ZhenshengLee/ros2_jetson/issues/10 in galactic

I’m not sure why rmw could cause this problem, does changing rmw would solve this issue?

discovery protocol is implemented in RMW implementation, so changing rmw would solve the problem.

all ros2cli depends on rclpy, may using rclcpp would be a workaround way to bypass this issue?

no i do not think so, related to previous comment, discovery depends on underneath rmw implementation.

does this issue being resolved in the future release of ros2, like galactic or humble?

i cannot reproduce this issue with my local environment and rolling branch.

Exactly, I’ve seen both issues.

problem-1: Cache (daemon) retaining nodes killed long ago. problem-2: Cache (daemon) not adding new nodes.

I’m trying to find reproducible examples, currently I can make it happen 100% of the time, but on a complex setup involving ros2_control with 2 controllers and launching and stopping navigation2.

There may also be underlying rmw issues causing problem-2, since I’ve seen that rviz2 would not list the topics from the newly spawned nodes, and even though I haven’t looked in depth, I believe rviz2 has 0 relation with ros2cli.

Ah, i see. you are saying

But this cache is getting outdated and only restarting the daemon fixes it.

problem-1: old cache can be seen, and will not be cleaned?

But I am starting new nodes and they do not show up on any commands that use the daemon, even after waiting several minutes.

problem-2: cache does not get updated?

Am i understanding correct?