I observed a very high false positive rate with the current implementation of BloomFilter, at times as high as 100%. My test code and results are given below. I also found out what can be changed to fix it, although not completely sure why my fix worked. I thought maybe bitmask method’s output has some bias, but upon further inspection it seems alright. I hope I am using the APIs correctly and don’t have some other bug in my test code. Would be great if someone can review and let me know. tks!
P.S.: I understand that BloomFilter code might be internal to caffeine, but just want to highlight my observation.
Line: https://github.com/ben-manes/caffeine/blob/master/simulator/src/main/java/com/github/benmanes/caffeine/cache/simulator/admission/bloom/BloomFilter.java#L166
Current code:
static long bitmask(int hash) {
return 1L << ((hash >>> 8) & INDEX_MASK);
}
Number of Insertions |
False positives(% ) |
True positives |
1024 |
27 (2.636719%) |
1024 |
4096 |
640 (15.625000%) |
4096 |
16384 |
15213 (92.852783%) |
16384 |
65536 |
65536 (100.000000%) |
65536 |
262144 |
262144 (100.000000%) |
262144 |
1048576 |
1048576 (100.000000%) |
1048576 |
New implementation:
static long bitmask(int hash) {
return 1L << ((hash >>> 24) & INDEX_MASK);
}
Number of Insertions |
False positives(%) |
True positives |
1024 |
15 (1.464844%) |
1024 |
4096 |
96 (2.343750%) |
4096 |
16384 |
391 (2.386475%) |
16384 |
65536 |
1598 (2.438354%) |
65536 |
262144 |
6326 (2.413177%) |
262144 |
1048576 |
25600 (2.441406%) |
1048576 |
Test method:
public void bloomFilterTest() {
System.out.println("Number of Insertions\tFalse positives(%)\tTrue positives");
for (int capacity = 2 << 10; capacity < 2 << 22; capacity = capacity << 2) {
long[] input = new Random().longs(capacity).distinct().toArray();
BloomFilter bf = new BloomFilter(input.length / 2, new Random().nextInt());
int truePositives = 0;
int falsePositives = 0;
int i = 0;
// Add only first half of input array to bloom filter
for (; i < (input.length / 2); i++) {
bf.put(input[i]);
}
// First half should be part of the bloom filter
for (int k = 0; k < i; k++) {
truePositives += bf.mightContain(input[k]) ? 1 : 0;
}
// Second half shouldn't be part of the bloom filter
for (; i < input.length; i++) {
falsePositives += bf.mightContain(input[i]) ? 1 : 0;
}
System.out.format("%d\t\t%d(%f%%)\t\t%d\n",
input.length / 2, falsePositives,
((float) falsePositives / (input.length / 2)) * 100, truePositives);
}
}
All questions are wonderful. I’m going to edit your message to number the sub-items.