go: runtime/race: inconsistent snapshot of vector clocks during finalizer sync leads to false positive

Over at CockroachDB, we’ve been debugging a reported data race on and off for the past month: https://github.com/cockroachdb/cockroach/issues/46793. There’s a lot of context in that thread and it might be helpful to skim it, if only to see all the possibilities we ruled out, but I’ll try to keep this issue as self-contained as possible.

The interactions are deep in the guts of the database and I’ve never been able to pull out a smaller reproduction. However, this example demonstrates the actors at play: https://play.golang.org/p/FViQigC1Q-N. We have one goroutine pulling pointers to messages from a channel, reading the value of those messages, and putting them in a sync.Pool. A collection of other goroutines are pulling messages out of the sync.Pool, reading/writing to them, and sending them on the channel. It’s all innocent enough. It’s also pretty clear with everything else stripped away that we should just be storing values in the channel and avoiding the sync.Pool entirely, but that’s a topic for another time.

What we see is that occasionally under heavy stress, two of the source goroutines will report a race with one another on a message’s memory address. At first, we thought this meant that we were double-freeing these messages by placing them into the pool twice somehow. No amount of reference counting or object tracing found this to be the case.

At this point, we started to look into the sync.Pool implementation itself. Most of the details are in https://github.com/cockroachdb/cockroach/issues/46793 so I won’t say too much here other than that we could still reproduce when disabling all of the sync.Pool except the per-P private storage. So even without any cross kernel thread object sharing, we still had a reported race.

We then saw an even more confusing race report. If we split the source goroutines into a group that used the pool and a group that allocated each message directly, we would still see reports of races between the two classes of sources. In other words, we’d see a race between the first write to an object after it was allocated and the first write to it after it had been pulled from the pool. Given the structure of the program, these operations should have the fattest happens-before edge imaginable between them, so something was clearly going wrong.

We then turned our eyes to the race detector and ThreadSanitizer itself, as this was starting to look like a loss of causality between goroutines due to some failure to propagate vector clocks during either the channel send between the source and the sink or during the sync.Pool insertion on the sink. We never saw any reported races due to the sink goroutine’s read of the message, so the latter possibility was the best guess. After instrumenting all SyncClock interactions and filtering down to just those on the sync.Pool (with len(poolRaceHash) dropped down to 1) we started to see a clearer picture emerge.

During the sink goroutine’s insertion into the sync.Pool, it’s corresponding ReleaseMerge tsan operation was not fully propagating its vector clock. Specifically, it could fail to propagate clock state for other goroutines if it hit this branch, which was introduced as part of this optimization. This optimization tracks the latest time that a thread/goroutine performs an acquire operation (i.e. merges other vector clocks with its own) and avoids an expensive full update of a SyncClock during a release operator (i.e. merges its clock with the clock associated with a synchronization element) if that clock already had a reading above the thread’s last_acquire_ timestamp. If I understand correctly, the correctness of this optimization is founded somewhere along the lines of: if the SyncClock has a higher reading for a given thread than the last time that thread synced with any other threads, then that thread must not have anything new to tell the SyncClock about anyone other than itself due to how causality flows through vector clocks. So if we hit that branch, we don’t expect the acquiring thread to have any clock values above those in the SyncClock (other than its own). And yet, I was able to trigger the following assertion, which violates the fundamental invariant that this optimization is based on:

@@ -248,6 +248,29 @@ void ThreadClock::release(ClockCache *c, SyncClock *dst) {
   // since the last release on dst. If so, we need to update
   // only dst->elem(tid_).
   if (dst->elem(tid_).epoch > last_acquire_) {
+    uptr i = 0;
+    for (ClockElem &ce : *dst) {
+      // Ignore the current thread.
+      if (i == tid_) {
+        i++;
+        continue;
+      }
+      // Ignore dirty state. This might not be necessary, but can't hurt.
+      bool dirty = false;
+      for (unsigned j = 0; j < kDirtyTids; j++) {
+        if (i == dst->dirty_[i]->tid) {
+          dirty = true;
+          break;
+        }
+      }
+      if (!dirty) {
+        // The current thread should never have more advanced knowledge about
+        // other threads, or we shouldn't be in this branch.
+        CHECK_LE(clk_[i], ce.epoch);
+      }
+      i++;
+    }
     UpdateCurrentThread(c, dst);
     if (dst->release_store_tid_ != tid_ ||
         dst->release_store_reused_ != reused_)

Furthermore, I was able to see that the clock element that triggered the assertion was from the goroutine that performed the initial mutation on the memory location which was later flagged in the race. In other words, the clock element that was was in violation was the very one we needed to merge into the SyncClock to ensure proper propagation of causality to establish the happens-before link to show that this is not a data race.

From here, we began looking into how this invariant could be violated. I noticed in the tsan logs that shortly before, I usually saw a call to AcquireGlobal. This function is meant to synchronize the caller thread with all other threads be establishing a happens-before relation from all threads to the current thread. It is called by the Go finalizer goroutine right before executing finalizers. AcquireGlobal is somewhat strange in that it iterates over every other thread, grabs its current clock reading, and puts that into the caller’s clock. Critically, it does not do so during a stop-the-world pause or with any synchronization with the other threads/goroutines. This means that as it iterates over other goroutines, it may see an inconsistent snapshot of their state.

This is exactly what I saw in the logs. I often saw histories like:

#6: AcquireGlobal: thread=1, epoch=1000
#6: AcquireGlobal: thread=2, epoch=1000
...
#378: Acquire, last_acquired_=146416
#6: AcquireGlobal: thread=427, epoch=1000
#839: Acquire, last_acquired_=375081
#6: AcquireGlobal: thread=428, epoch=1000
#6: AcquireGlobal: thread=429, epoch=1000
...
#6: AcquireGlobal: thread=900, epoch=1000

And later on I would see the assertion in ThreadClock::release fire with:

#839: assertion failed for thread=378; 149294 > 148778

Notice that the finalizer goroutine acquired a value from goroutine 378 before both goroutine 378 and goroutine 839 performed acquisitions but didn’t acquire a value from goroutine 839 until after those two acquisitions. This presents an opportunity for the final state of the finalizer goroutine’s clock to violate invariant we previously relied upon in ThreadClock::release. If the finalizer goroutine’s vector clock ever made it into the sync.Pool's SyncClock, goroutine 839’s later release into that SyncClock could be ignored, even though goroutine 839 has a larger value for 378’s clock element.

So I think we are seeing a false positive from the race detector due to a combination of the inconsistent clock state observed by finalizers during their call to AcquireGlobal and the optimization in ThreadClock::release. I confirmed that I can not hit this race if I disable that optimization.

As an added vote of confidence for this being the problem or being in the near vicinity of the problem, the race became significantly more prevalent in the CockroachDB codebase at about the same time that we turned on https://github.com/cockroachdb/pebble by default as our replacement for RocksDB (in C++). A notable aspect of Pebble, which only became clear after all of this debugging, is that it uses finalizers in a few places to manage C memory that is referenced by pooled objects. It seems very possible that the additional uses of finalizers made this bug easier to hit.

Unfortunately, I don’t have great suggestions for how to fix this. The cleanest solution is probably to make AcquireGlobal properly synchronize with all other threads’/goroutines’ sync operations to ensure that it always observes a consistent global clock state. There will likely be some kind of performance impact from such a change though, as we’d now need to acquire a global read-lock on every other sync operation.

cc. @dvyukov since, in all likelihood, you’re going to be the one taking a look at this.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 12
  • Comments: 37 (36 by maintainers)

Commits related to this issue

Most upvoted comments

I’m going to wait 2 weeks to make sure there’s nothing else coming for tsan. Then I’ll rebuild the .syso files.

Proposed fix: https://reviews.llvm.org/D80474 @nvanbenschoten do you mind reviewing it? I think you are the best qualified to meaningfully review the fix logic.