caffeine: Very high false positive rate observed for BloomFilter implementation.

I observed a very high false positive rate with the current implementation of BloomFilter, at times as high as 100%. My test code and results are given below. I also found out what can be changed to fix it, although not completely sure why my fix worked. I thought maybe bitmask method’s output has some bias, but upon further inspection it seems alright. I hope I am using the APIs correctly and don’t have some other bug in my test code. Would be great if someone can review and let me know. tks! P.S.: I understand that BloomFilter code might be internal to caffeine, but just want to highlight my observation.

Line: https://github.com/ben-manes/caffeine/blob/master/simulator/src/main/java/com/github/benmanes/caffeine/cache/simulator/admission/bloom/BloomFilter.java#L166

Current code:

static long bitmask(int hash) {
    return 1L << ((hash >>> 8) & INDEX_MASK);
  }

Number of Insertions	False positives(% )	True positives
1024	27 (2.636719%)	1024
4096	640 (15.625000%)	4096
16384	15213 (92.852783%)	16384
65536	65536 (100.000000%)	65536
262144	262144 (100.000000%)	262144
1048576	1048576 (100.000000%)	1048576

New implementation:

static long bitmask(int hash) {
    return 1L << ((hash >>> 24) & INDEX_MASK);
  }

Number of Insertions	False positives(%)	True positives
1024	15 (1.464844%)	1024
4096	96 (2.343750%)	4096
16384	391 (2.386475%)	16384
65536	1598 (2.438354%)	65536
262144	6326 (2.413177%)	262144
1048576	25600 (2.441406%)	1048576

Test method:

    public void bloomFilterTest() {
        System.out.println("Number of Insertions\tFalse positives(%)\tTrue positives");
        for (int capacity = 2 << 10; capacity < 2 << 22; capacity = capacity << 2) {
            long[] input = new Random().longs(capacity).distinct().toArray();
            BloomFilter bf = new BloomFilter(input.length / 2, new Random().nextInt());
            int truePositives = 0;
            int falsePositives = 0;
            int i = 0;
            // Add only first half of input array to bloom filter
            for (; i < (input.length / 2); i++) {
                bf.put(input[i]);
            }
            // First half should be part of the bloom filter
            for (int k = 0; k < i; k++) {
                truePositives += bf.mightContain(input[k]) ? 1 : 0;
            }
            // Second half shouldn't be part of the bloom filter
            for (; i < input.length; i++) {
                falsePositives += bf.mightContain(input[i]) ? 1 : 0;
            }
            System.out.format("%d\t\t%d(%f%%)\t\t%d\n",
                input.length / 2, falsePositives,
                ((float) falsePositives / (input.length / 2)) * 100, truePositives);
        }
    }

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 57 (33 by maintainers)

Commits related to this issue

High false positive rate in simulator's BloomFilter (fixes #85) See the issue for the discussion. Clearly I goofed in my thought process and should have taken the highest most bits. This had no effec... — committed to ben-manes/caffeine by ben-manes 8 years ago
High false positive rate in simulator's BloomFilter (fixes #85) See the issue for the discussion. Clearly I goofed in my thought process and should have taken the highest most bits. This had no effec... — committed to ben-manes/caffeine by ben-manes 8 years ago

Most upvoted comments

All questions are wonderful. I’m going to edit your message to number the sub-items.

ben-manes on Jan 16, 2017