opentelemetry-js: Sleeping computer causes span timings to be incorrect in browser

While debugging an issue for a user here https://gitter.im/open-telemetry/opentelemetry-node?at=5e6a4b54d17593652b7c8154 it was found that while a computer is slept or hibernated, the performance.now() monotonic clock may be paused. This causes the assumption that performance.timeOrigin + performance.now() ~= Date.now() to be incorrect by some arbitrary amount of time which may be hours or days.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 43 (40 by maintainers)

Commits related to this issue

Most upvoted comments

Hey all, wanted to chime in since I’ve spent several hours highly confused as to why I couldn’t see query my spans and it sounds like this might be why.

It looks to me like performance.timeOrigin is scoped per page load in Chrome, but per browser load in Firefox. That is, opening a new tab or refreshing the page gets me a new value for performance.timeOrigin in Chrome, but not in Firefox. (I’m on Firefox 84 on Mac.)

Until a few minutes ago, I hadn’t restarted Firefox for a couple weeks, but had put my computer to sleep every night. As a result, all my OpenTelemtry traces had timestamps from Dec 11, and I was pulling out my hair trying to understand why I couldn’t find my spans in Lightstep, Honeycomb, or Elastic. Restarting Firefox magically solved my problem.

Am I correct that this bug is probably to blame? And does that mean this bug is minor for Chrome but kinda huge for Firefox? (I.e., any Firefox user on my site will be invisible in my monitoring tool of choice unless they’ve recently restarted their browser.)

This issue of someone sleeping the computer, or leaving the tab open to go get coffee, etc. is sort of a basic limitation of measurement in the browser world. If a user starts an operation, shuts their computer, and then comes back 8 hours later and it finishes - then how long did the operation take?

Here are a few ideas for how we could deal with this:

  • Provide some documentation that this suspected to be a known issue and we suggest that people ignore large outlier spans from the client in their debugging (this is the easiest approach)
  • Attempt to do some detection in the web library itself of when the user closes the tab or moves to a different tab (see https://stackoverflow.com/questions/3888902/detect-browser-or-tab-closing), and do some handling of that, e.g. add a special tag to all spans that the browser tab was switched and possibly even end and send what spans we currently have (not sure on that).
  • Record both the high-resolution timestamp of spans (using performance.now) to make sure we aren’t losing accuracy, but also record the Date.now times, and then we can do some correction computations before spans are written and if the times don’t cohere, we can make a best-effort fix, and possibly add some special tag to the span to indicate that it had this monotonic clock issue.

I think it would be worth us digging more into the browser specs, docs, etc. to better understand this Chrome on Mac behavior and what was intended by the authors for performance.now. Maybe it is a Chrome bug.

When I run new Date(performance.timeOrigin+performance.now()).getTime() - new Date().getTime() in my browser, I seem to get 0.

I think the wall clock time could get out of sync with the high resolution timer for any number of reasons, so many that I think it can be taken for a fact that it will drift if the application is running for any longer amount of time. Examples:

Only affecting wall clock time but not high resolution timer:

  • NTP sync or other time correction
  • Leap seconds (which do affect UTC, contrary to daylight saving time changes)

Affecting high resolution timer:

  • Any kind of suspension:
    • “I closed my notebook lid” / hibernate / suspend / sleep – this will mostly affect the browser and dev machines
    • VM being migrated to another physical host and probably other VM shenanigans you may face when running on any PaaS/FaaS product where you-knows-what may happen to the underlying VM.
    • CPU clock changes are known to cause inaccuracies in some configurations
  • Some hardware / firmware bugs & corner cases (at least historically)

It seems like performance.now is uv_hrtime: https://github.com/nodejs/node/blob/5fad0b93667ffc6e4def52996b9529ac99b26319/src/node_perf_common.h#L19

You found https://github.com/nodejs/node/issues/17893 already, but even if the offset is calculated correctly at startup, there is just no way to solve this issue with a constant calculated-only-once offset. There is https://github.com/libuv/libuv/issues/1674 for the underlying uv_hrtime API, which has been closed as stale.

uv_hrtime is implemented with QueryPerformanceCounter on Windows. https://github.com/libuv/libuv/blob/f250c6c73ee45aa93ec44133c9e0c635780ea741/src/win/util.c#L490-L514, which is definitely liable to the aforementioned time drift.

The Linux (actually Linux-specific, other Unix have different implementations) is more complicated: The generic Unix entrypoint has only one line https://github.com/libuv/libuv/blob/0b1c752b5c40a85d5c749cd30ee6811997a8f71e/src/unix/core.c#L110-L112 calling the specific part with {{UV_CLOCK_PRECISE}} https://github.com/libuv/libuv/blob/c40f8cb9f8ddf69d116952f8924a11ec0623b445/src/unix/linux-core.c#L121-L155. This ends up calling clock_gettime(CLOCK_MONOTONIC), which is a libc/POSIX API, described for Linux e.g. here https://man7.org/linux/man-pages/man3/clock_gettime.3.html:

This clock does not count time that the system is suspended.

On Linux there would be CLOCK_BOOTTIME which would (supposedly) solve this problem but I believe it is not accessible through Node. Also, there is no equivalent for Windows (unless you want something a bit less precise; though maybe CLOCK_BOOTTIME is also less precise than CLOCK_MONOTONIC).

You could look at how other SDKs solve this problem. Python has a different API https://docs.python.org/3/library/time.html#time.time_ns, which ends up in this implementation https://github.com/python/cpython/blob/0ff626f210c69643d0d5afad1e6ec6511272b3ce/Python/pytime.c#L847-L955, but that’s probably not really applicable to Node.js as (an is a “a bit less precise” at least on Windows).

I think you could implement rather one to one what Java does though: There, each local root span takes a timestamp with both the most precise available wall clock time and the high resolution timer when it starts to calculate the offset. From then on, only the HR timer is used, i.e. relative offsets of local child span start times and all end times are precise. Of course, if a suspension / leap seconds / … happens during the trace, you still have the time drift, but that is much less likely and the next trace will be correct again. This is implemented with the aptly named https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/AnchoredClock.java and this logic at span creation: https://github.com/open-telemetry/opentelemetry-java/blob/16be81aed803e15694de29c9cea25f7bcf4d77c1/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/SdkSpan.java#L151-L182

Am I correct that this bug is probably to blame?

Most likely

I doubt this is enough to actually address all possible permutations of clockskew though 😕

Unfortunately this is also likely correct

I tried to tackle this a while ago in #1019. It is unfortunately a tricky problem. I do think it should be solved though.

When I run new Date(performance.timeOrigin+performance.now()).getTime() - new Date().getTime() in my browser, I seem to get 0.

just close laptop for 30 seconds, open and run it, seems like a valid bug