caffeine: Unintuitive behavior (bug?) with `put()`

Upon migrating from Guava cache to Caffeine in bazel we are seeing some flakiness in tests that heavily exercise weak/soft-valued caches. It appears that the behavior of put() for an already present key in Caffeine diverges from Guava cache (and ConcurrentHashMap) in that a concurrent thread calling getIfPresent can get null, as opposed to either the old value or the new value.

Note that this strange behavior occurs even if the put call does not change the existing value. I believe it’s the result of non-atomically clearing the existing value reference before setting it.

The following reproduces the issue within a few iterations:

Cache<String, String> cache = Caffeine.newBuilder().weakValues().build();
String key = "key";
String val = "val";
cache.put("key", "val");

AtomicReference<Exception> e = new AtomicReference<>();
List<Thread> threads = new ArrayList<>();
for (int i = 0; i < 10; i++) {
  int name = i;
  Thread t =
          new Thread(
              () -> {
                for (int j = 0; j < 1000; j++) {
                  if (Math.random() < .5) {
                    cache.put(key, val);
                  } else if (cache.getIfPresent(key) == null) {
                    e.compareAndSet(
                        null,
                        new IllegalStateException(
                            "Thread " + name + " observed null on iteration " + j));
                    break;
                  }
                }
              });
  threads.add(t);
  t.start();
}

for (Thread t : threads) {
  t.join();
}

if (e.get() != null) {
  throw e.get();
}

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 19 (19 by maintainers)

Commits related to this issue

Fix racy read of null weak/soft old value when writing a new value (fixes #568) — committed to ben-manes/caffeine by ben-manes 3 years ago
Improve robustness in racy scenarios (fixes #568) 1. When an entry is updated then a concurrent reader should either observe the old or new value. This operation replaces the j.l.Reference instance s... — committed to ben-manes/caffeine by ben-manes 3 years ago
Improve robustness in racy scenarios (fixes #568) 1. When an entry is updated then a concurrent reader should observe either the old or new value. This operation replaces the j.l.Reference instance s... — committed to ben-manes/caffeine by ben-manes 3 years ago
Improve robustness in racy scenarios (fixes #568) 1. When an entry is updated then a concurrent reader should observe either the old or new value. This operation replaces the j.l.Reference instance s... — committed to ben-manes/caffeine by ben-manes 3 years ago
Improve robustness in racy scenarios (fixes #568) 1. When an entry is updated then a concurrent reader should observe either the old or new value. This operation replaces the j.l.Reference instance s... — committed to ben-manes/caffeine by ben-manes 3 years ago
Improve robustness in racy scenarios (fixes #568) 1. When an entry is updated then a concurrent reader should observe either the old or new value. This operation replaces the j.l.Reference instance s... — committed to ben-manes/caffeine by ben-manes 3 years ago
Improve robustness in racy scenarios (fixes #568) 1. When an entry is updated then a concurrent reader should observe either the old or new value. This operation replaces the j.l.Reference instance s... — committed to ben-manes/caffeine by ben-manes 3 years ago
Improve robustness in racy scenarios (fixes #568) 1. When an entry is updated then a concurrent reader should observe either the old or new value. This operation replaces the j.l.Reference instance s... — committed to ben-manes/caffeine by ben-manes 3 years ago
Update //third_party/caffeine to v2.9.2 Contains critical fixes for ben-manes/caffeine#568. Closes #13651. Signed-off-by: Philipp Wollermann <philwo@google.com> — committed to bazelbuild/bazel by justinhorvitz 3 years ago

Most upvoted comments

I think that the issue as originally described is not actually affecting our tests, I just discovered it by accident. Instead, my understanding is that we’re encountering a race between cleanup of an evicted value and a call to put to update that value.

During cleanup, we check that the node’s value reference is still the one that was enqueued.
A concurrent call to put updates the node’s value reference to a live value.
The call to evictEntry proceeds to obtain a lock on the node, but does not check whether the node was updated with a new value. Thus we proceed to clean up a live entry.

With that understanding, I modified evictEntry with the following check and the issue disappeared, even without any of the other changes we discussed:

synchronized (n) {
  value[0] = n.getValue();
  if (cause == RemovalCause.COLLECTED && value[0] != null) {
    resurrect[0] = true;
    return n;
  }
}

Does this explanation and solution make sense to you?

I still think we should fix the originally reported issue as well.

justinhorvitz on Jun 29, 2021