rmw_cyclonedds: system adjustment can break timer callback periodicity

Bug report

related to https://github.com/ros2/ros2/issues/1080

Steps to reproduce issue

  • Terminal - 1
$ ros2 run demo_nodes_cpp talker
[INFO] [1611041505.335502285] [talker]: Publishing: 'Hello World: 1'
[INFO] [1611041506.335387664] [talker]: Publishing: 'Hello World: 2'
[INFO] [1611041507.335367773] [talker]: Publishing: 'Hello World: 3'
...
  • Terminal - 2
$ sudo timedatectl set-ntp no
$ sudo date -s @`date +%s | xargs -I@ echo @-5 | bc`

Expected behavior

timer callback is called in 1Hz with system time adjustment.

Actual behavior

once system time jumps back in 5 seconds, timer will be in sleep in 5 more seconds.

Additional information

futex system call uses FUTEX_CLOCK_REALTIME as following.

16:36:58.220792 futex(0x556d8f3ee868, FUTEX_WAKE_PRIVATE, 1) = 0
16:36:58.220924 ioctl(2, TCGETS, {B38400 opost isig icanon echo ...}) = 0
16:36:58.220983 write(2, "\33[0m[INFO] [1611041818.220912410"..., 79) = 79
16:36:58.221198 futex(0x556d8f3ee8b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1611041819, tv_nsec=220456764}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
16:36:59.220629 futex(0x556d8f3ee868, FUTEX_WAKE_PRIVATE, 1) = 0
16:36:59.220875 ioctl(2, TCGETS, {B38400 opost isig icanon echo ...}) = 0
16:36:59.220936 write(2, "\33[0m[INFO] [1611041819.220860661"..., 79) = 79
16:36:59.221197 futex(0x556d8f3ee8b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1611041820, tv_nsec=220465220}, FUTEX_BITSET_MATCH_ANY

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

@samialperen Of course there is a quick way to deal with it! Unless I’m terribly mistaken, all you need to do is switch over everything that currently uses the wall clock to a monotonic clock, so basically:

It does use select and I don’t know what it does in this respect, but in Cyclone the time timeout is ∞, so that doesn’t matter 🙂

You could also use CLOCK_MONOTONIC throughout, the difference is in whether the clock keeps ticking while the machine is suspended, that’s probably not an issue for you. CLOCK_BOOTTIME seems like a better choice to me, but YMMV.

In short, with only minimal changes there is no source of wall clock time in the code base anymore and everything that sleeps does so based on a clock that is not affected by adjusting the system time. There are downsides, and to properly eliminate those and have the truly perfect behaviour requires more thought. A (probably non-exhaustive) list:

  • There are some things for which you really want the wall clock time, e.g., for the start time stamp, in practice in most cases for all the time stamps in the trace. I don’t think this would be an issue for you.
  • There are some features in DDS that are specified in terms of the source timestamp (like lifespan — see the DDS spec 1.5 section 2.2.3.16 if you’re curious) and that really breaks very badly if clocks are not reasonably well aligned. Again, I don’t think this will be a problem for you (IINM ROS doesn’t use any features that have this property).
  • There is a real problem with the timestamps in notifications from DDS about lease expiration of remote entities: normal creation/destruction gets timestamps on the source machine, but lease expiration happens on another machine with a different clock. With reasonably well-synchronised clocks, this can be papered over, but it is problematic. Picking the least-bad solution is not trivial. The ROS graph handling will be fine, it doesn’t look at the timestamp, so also a non-issue for you.
  • There is the DDS “source timestamp” in application data. Calling dds_write (or dispose, or …) automatically puts the current time as returned by dds_time into the source timestamp field, and changing that clock would be a big change in externally visible behaviour. For application data, there is the alternative function dds_write_ts (and friends) that allow the application to specify the source timestamp, so with this quick hack you would (probably) want to change all calls involving the writing of data. It is ROS, so that’s easy.

I believe only the last one is likely to be an issue for you, but perhaps even that one isn’t (it is easy: if you don’t use the source timestamp, or only for computing differences between timestamps from a single source, or you’re on a single machine and don’t care for a relation to the wall clock, you can ignore it). If you do care:

  • Firstly, just replace all two calls to dds_write with dds_write_ts and put whatever value you want in the timestamp parameter.
  • Secondly, for writing serialised data (not quite a DDS API) and loaned samples (full native support for this is finally near in Cyclone’s C API) it relies on some unstable interface in Cyclone: dds_writecdr. This one is misnamed and has a misfeature. Backwards compatibility makes me shy away from fixing it. You probably should replace it with the (very) slightly less problematic dds_forwardcdr. Basically, a call to dds_writecdr(wr, serdata) has to change into:
      serdata->statusinfo = 0;
      serdata->timestamp.v = ...;
      dds_forwardcdr(wr, serdata);
    
    Where the ... is whatever you want to put into the timestamp. It’s ugly, but it’ll work.
  • Finally, there are some calls to dds_time guarded by the REPORT_BLOCKED_REQUESTS and REPORT_LATE_MESSAGES macros. These are normally not turned off, so you can leave them or change them in the obvious manner.

So I think you’ll be all set with a few minutes’ worth of hacking — probably less than the time I spent writing this 😀

And if there’s any trouble or you have further questions, please just ask.

By the way, as I have mentioned before, Cyclone ought to be fixed. The standard for a PR for that is obviously higher, and ideally would also consider macOS and Windows, but I don’t think it will actually involve many changes, rather just careful bookkeeping of which clock gets used for what. (And that probably starts with changing all internal timestamps from dds_time_t to ddsrt_wctime_t because then the compiler will do the heavy lifting of finding inconsistencies.)

Thanks @fujitatomoya, but no need to worry. I hadn’t started yet and besides it is something that should be done in Cyclone anyway. The bulk of Cyclone’s code base is already capable of dealing with time jumps, it is just a few places that need some work.

but as far as i see, cyclonedds uses native handler such as pthread_xxx, so i think this can be addressed 🤞

Absolutely, my only problem is that if I switch dds_waitset_wait to use a monotonic clock I need to also update dds_waitset_wait_until (an API change) because it uses the same underlying implementation, or have to have two underlying implementations of waitset_wait, or unify them in some way … and the same goes for my condition variables.

Fortunately all internal time stamps in Cyclone are already tagged (via the type system) as wall clock, monotonic-uptime or monotonic-time-since-boot. So it is a matter of bookkeeping and minimising code duplication. There is no fundamental problem, it just takes some work.

Just to be sure: the issue you’re running into is that implementation of the timeout in rmw_wait (which is simply passed on to Cyclone’s dds_waitset_wait) is using the wall clock rather than a monotonic clock. Correct?

I don’t have a problem changing the behaviour of dds_waitset_wait provided the change applies to all timeouts. In particular that’ll probably mean an alternative version of wait_until will be needed and that has some impact as there are really 3 clocks in play (wall clock; monotonic “uptime” clock and monotonic clock that continues counting while the machine is suspended). So I need a bit of time.