caffeine: Unintuitive behavior (bug?) with `put()`

Upon migrating from Guava cache to Caffeine in bazel we are seeing some flakiness in tests that heavily exercise weak/soft-valued caches. It appears that the behavior of put() for an already present key in Caffeine diverges from Guava cache (and ConcurrentHashMap) in that a concurrent thread calling getIfPresent can get null, as opposed to either the old value or the new value.

Note that this strange behavior occurs even if the put call does not change the existing value. I believe it’s the result of non-atomically clearing the existing value reference before setting it.

The following reproduces the issue within a few iterations:

Cache<String, String> cache = Caffeine.newBuilder().weakValues().build();
String key = "key";
String val = "val";
cache.put("key", "val");

AtomicReference<Exception> e = new AtomicReference<>();
List<Thread> threads = new ArrayList<>();
for (int i = 0; i < 10; i++) {
  int name = i;
  Thread t =
          new Thread(
              () -> {
                for (int j = 0; j < 1000; j++) {
                  if (Math.random() < .5) {
                    cache.put(key, val);
                  } else if (cache.getIfPresent(key) == null) {
                    e.compareAndSet(
                        null,
                        new IllegalStateException(
                            "Thread " + name + " observed null on iteration " + j));
                    break;
                  }
                }
              });
  threads.add(t);
  t.start();
}

for (Thread t : threads) {
  t.join();
}

if (e.get() != null) {
  throw e.get();
}

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 19 (19 by maintainers)

Commits related to this issue

Most upvoted comments

I think that the issue as originally described is not actually affecting our tests, I just discovered it by accident. Instead, my understanding is that we’re encountering a race between cleanup of an evicted value and a call to put to update that value.

  1. During cleanup, we check that the node’s value reference is still the one that was enqueued.
  2. A concurrent call to put updates the node’s value reference to a live value.
  3. The call to evictEntry proceeds to obtain a lock on the node, but does not check whether the node was updated with a new value. Thus we proceed to clean up a live entry.

With that understanding, I modified evictEntry with the following check and the issue disappeared, even without any of the other changes we discussed:

synchronized (n) {
  value[0] = n.getValue();
  if (cause == RemovalCause.COLLECTED && value[0] != null) {
    resurrect[0] = true;
    return n;
  }
}

Does this explanation and solution make sense to you?

I still think we should fix the originally reported issue as well.