rmw_cyclonedds: system adjustment can break timer callback periodicity
Bug report
related to https://github.com/ros2/ros2/issues/1080
- Operating System:
- Ubuntu 20.04
- Installation type:
- source
- Version or commit hash:
- DDS implementation:
- rmw_cyclonedds
- Client library (if applicable):
- rclcpp
Steps to reproduce issue
- Terminal - 1
$ ros2 run demo_nodes_cpp talker
[INFO] [1611041505.335502285] [talker]: Publishing: 'Hello World: 1'
[INFO] [1611041506.335387664] [talker]: Publishing: 'Hello World: 2'
[INFO] [1611041507.335367773] [talker]: Publishing: 'Hello World: 3'
...
- Terminal - 2
$ sudo timedatectl set-ntp no
$ sudo date -s @`date +%s | xargs -I@ echo @-5 | bc`
Expected behavior
timer callback is called in 1Hz with system time adjustment.
Actual behavior
once system time jumps back in 5 seconds, timer will be in sleep in 5 more seconds.
Additional information
futex system call uses FUTEX_CLOCK_REALTIME as following.
16:36:58.220792 futex(0x556d8f3ee868, FUTEX_WAKE_PRIVATE, 1) = 0
16:36:58.220924 ioctl(2, TCGETS, {B38400 opost isig icanon echo ...}) = 0
16:36:58.220983 write(2, "\33[0m[INFO] [1611041818.220912410"..., 79) = 79
16:36:58.221198 futex(0x556d8f3ee8b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1611041819, tv_nsec=220456764}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
16:36:59.220629 futex(0x556d8f3ee868, FUTEX_WAKE_PRIVATE, 1) = 0
16:36:59.220875 ioctl(2, TCGETS, {B38400 opost isig icanon echo ...}) = 0
16:36:59.220936 write(2, "\33[0m[INFO] [1611041819.220860661"..., 79) = 79
16:36:59.221197 futex(0x556d8f3ee8b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1611041820, tv_nsec=220465220}, FUTEX_BITSET_MATCH_ANY
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (8 by maintainers)
@samialperen Of course there is a quick way to deal with it! Unless I’m terribly mistaken, all you need to do is switch over everything that currently uses the wall clock to a monotonic clock, so basically:
src/ddsrt/src/time/posix.cchangedds_time,dds_time_wallclockanddds_time_elapsedto useCLOCK_BOOTTIME(that was always the intent fordds_time_elapsedanyway)src/ddsrt/src/sync/posix.cchangedds_cond_initto:nanosleep, I thinkdds_sleepforwill be ok, but you could change https://github.com/eclipse-cyclonedds/cyclonedds/blob/9e119b958a00757a147db36ccb8c42c1fea62457/src/ddsrt/src/time.c#L33 to useclock_nanosleep;It does use
selectand I don’t know what it does in this respect, but in Cyclone the time timeout is ∞, so that doesn’t matter 🙂You could also use
CLOCK_MONOTONICthroughout, the difference is in whether the clock keeps ticking while the machine is suspended, that’s probably not an issue for you.CLOCK_BOOTTIMEseems like a better choice to me, but YMMV.In short, with only minimal changes there is no source of wall clock time in the code base anymore and everything that sleeps does so based on a clock that is not affected by adjusting the system time. There are downsides, and to properly eliminate those and have the truly perfect behaviour requires more thought. A (probably non-exhaustive) list:
lifespan— see the DDS spec 1.5 section 2.2.3.16 if you’re curious) and that really breaks very badly if clocks are not reasonably well aligned. Again, I don’t think this will be a problem for you (IINM ROS doesn’t use any features that have this property).dds_write(ordispose, or …) automatically puts the current time as returned bydds_timeinto the source timestamp field, and changing that clock would be a big change in externally visible behaviour. For application data, there is the alternative functiondds_write_ts(and friends) that allow the application to specify the source timestamp, so with this quick hack you would (probably) want to change all calls involving the writing of data. It is ROS, so that’s easy.I believe only the last one is likely to be an issue for you, but perhaps even that one isn’t (it is easy: if you don’t use the source timestamp, or only for computing differences between timestamps from a single source, or you’re on a single machine and don’t care for a relation to the wall clock, you can ignore it). If you do care:
dds_writewithdds_write_tsand put whatever value you want in the timestamp parameter.dds_writecdr. This one is misnamed and has a misfeature. Backwards compatibility makes me shy away from fixing it. You probably should replace it with the (very) slightly less problematicdds_forwardcdr. Basically, a call todds_writecdr(wr, serdata)has to change into: Where the...is whatever you want to put into the timestamp. It’s ugly, but it’ll work.dds_timeguarded by theREPORT_BLOCKED_REQUESTSandREPORT_LATE_MESSAGESmacros. These are normally not turned off, so you can leave them or change them in the obvious manner.So I think you’ll be all set with a few minutes’ worth of hacking — probably less than the time I spent writing this 😀
And if there’s any trouble or you have further questions, please just ask.
By the way, as I have mentioned before, Cyclone ought to be fixed. The standard for a PR for that is obviously higher, and ideally would also consider macOS and Windows, but I don’t think it will actually involve many changes, rather just careful bookkeeping of which clock gets used for what. (And that probably starts with changing all internal timestamps from
dds_time_ttoddsrt_wctime_tbecause then the compiler will do the heavy lifting of finding inconsistencies.)Thanks @fujitatomoya, but no need to worry. I hadn’t started yet and besides it is something that should be done in Cyclone anyway. The bulk of Cyclone’s code base is already capable of dealing with time jumps, it is just a few places that need some work.
Absolutely, my only problem is that if I switch
dds_waitset_waitto use a monotonic clock I need to also updatedds_waitset_wait_until(an API change) because it uses the same underlying implementation, or have to have two underlying implementations ofwaitset_wait, or unify them in some way … and the same goes for my condition variables.Fortunately all internal time stamps in Cyclone are already tagged (via the type system) as wall clock, monotonic-uptime or monotonic-time-since-boot. So it is a matter of bookkeeping and minimising code duplication. There is no fundamental problem, it just takes some work.
Just to be sure: the issue you’re running into is that implementation of the timeout in
rmw_wait(which is simply passed on to Cyclone’sdds_waitset_wait) is using the wall clock rather than a monotonic clock. Correct?I don’t have a problem changing the behaviour of
dds_waitset_waitprovided the change applies to all timeouts. In particular that’ll probably mean an alternative version ofwait_untilwill be needed and that has some impact as there are really 3 clocks in play (wall clock; monotonic “uptime” clock and monotonic clock that continues counting while the machine is suspended). So I need a bit of time.