Different approaches are studied in
“Fast Bloom Filters and Their Generalization” by Y Qiao, et al. (the one you e described here is referred to as “bloom-1” in the paper)
Qiao, Yan, Tao Li, and Shigang Chen. “Fast Bloom filters and their generalization.” IEEE Transactions on Parallel and Distributed Systems 25.1 (2013): 93-103.
I appreciate very much the references, both to the paper and to the software. Please note that, as was clear in the post, I did not claim that this was original. If you’d like to elaborate further on your experience, please do so.
To keep your false positive rate at 1%, aren’t you limited in the number of members of the population you can map to each 64-bit word? As in, one shouldn’t map more than 100 members into each 64-bit word? Something like that?
Would be nice to clarify the scalability per word here.
The size of the backing array must grow. In my case, I am allocating 12 bits per entry, so if you have 100 entries, you need 1200 bits or about 19 64-bit words. The more distinct entries you have, the more memory you need to allocate.
It works this way with conventional Bloom filters. You need an array that is large enough if you are going to accommodate your input size.
So the scalability is simple. It is linear. 12 bits per entry. You give me the number of entries, I give you the number of words you need. It is rather easy, something like 12 * number bits. It is no different from Bloom filters.
My intuition would be that such small partitions would lead to unacceptable variance. Do you have any analytical results on variance or tail bounds for the FPP?
Thanks for the reference. Indeed, section 3 of this paper anticipates all the ideas in this blog post, I think (in particular, 3.1, which describes picking a random k-subset from the B bits of a block, using a lookup table with a single hash function as input). But they don’t seem to have numbers for block sizes other than 512 (64-byte cache line) in the paper. It should be straightforward to compute bounds on FPP for B=64 from equations 3) and 4) in the paper, though.
One thought I had here was to not word-align them and use overlapping buckets, either at byte or even bit granularity. Loads and cache lines would be unaligned of course, but still few in number.
It would create more buckets but possibly even out the variations between them.
It is worth investigating. My concern is that it makes implementation more difficult. E.g., it might be ugly in Java. So you’d want important benefits.
Pretty much the original Bloom filter strategy, no? (i.e. overlapping the bitmaps for all k hash functions). What’s now called the “partitioned” approach (distinct from the “blocked” approach because it separates the bitmaps for each hash function) is more obvious than Bloom’s approach, I think, but performs a bit worse in theory and a bit better in practice (if you believe “A Case for Partitioned Bloom Filters”).
It’s a restriction on the distance between hashes for the same key — it’s a little more work to analyze (basic Bloom filters are easier due to the independence of each bit position).
I tried this with split block Bloom filters and byte-level granularity with blocks of size 256 and lanes of size 32 (that is, setting 1 bit in each 32-bit lane inside an 8-lane block). The difference in false positive probability was negligible – less than switching to 512-bit blocks and 64-bit lanes.
More than word-aligned buckets, wouldn’t it be a good idea to try with cacheline-aligned buckets? Because the buckets are cacheline-aligned, all of the bit tests would hit the same cacheline. And, as long as the bit tests have no dependency with each other, the processor should still be able to execute them in parallel, hiding the latency of loading multiple words from the cache.
The benefit being that, as each bucket would be 8x or 16x times the size of word, false positives ratio would be lower for a given bloom filter size.
I’d definitely recommend over the classic bloom filter design for most applications where performance matters.
We had a good experience using blocked bloom filters (256-bit blocks) for query processing in Apache Impala – any speedup in bloom filter query time was a big win. They also got ported to Apache Kudu – there’s a battle tested implementation with AVX2 support in that codebase.
The newer bloom filter support in Apache Parquet also uses a similar design – https://github.com/apache/parquet-format/blob/master/BloomFilter.md – which is a big win given the amount of data stored in that format (albeit most without bloom filter indices at this point).
That last link is very interesting: it’s actually a hybrid of the “blocking” and “partitioning” approaches (blocks are 32 bytes, partitions are 32 bits). It would be interesting to see analytical or simulation results exploring the parameter space of this hybrid approach.
is a python project able to compute the false positive probability of a structure with blocked bloom filters as described in the paper cited above (https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf ). It can also optimize a design based on constraints on the false positive probability or on the storage used.
(full disclosure: I a am the author of said project)
I think there is some paper out there about just storing a hash signature in a hash table, no full key and no value.
If the hash signature was seen before of course you get a positive. It is also possible to get a false positive (collision.)
Maybe it needs too many bits.
Anyway an insert only Robin Hood hash table sounds ideal for that, and easy to implement.
Delete operations for the Robin Hood hash table require maintaining a histogram of displacements and the unwind operation is a little tricky.
Yes, exactly. Cuckoo filters store just a signature, as do quotient filters. The former use cuckoo hashing, while the latter use Robin Hood linear probing.
I think quotient filters use the Cleary algorithm (bidirectional linear probing combined with Knuth’s “quotienting” trick), but they’re still an “ordered hash table” like Robin Hood.
Yeah there are many fingerprint table schemes around, which are theoretically superior to Bloom filters (by a factor of 1/ln 2). However, you can also store just hash codes without losing any information, if you’re hashing integers! Just apply a permutation (aka block cipher) instead of a hash function. Instead of block ciphers, I prefer to use a high-quality mixing function like the Murmur3 finalizer. You can see this approach here: https://github.com/senderista/hashtable-benchmarks.
Different approaches are studied in
“Fast Bloom Filters and Their Generalization” by Y Qiao, et al. (the one you e described here is referred to as “bloom-1” in the paper)
Fwiw I’ve implemented bloom-1 as well here
https://hackage.haskell.org/package/unagi-bloomfilter
Here is what I think is the full reference:
Qiao, Yan, Tao Li, and Shigang Chen. “Fast Bloom filters and their generalization.” IEEE Transactions on Parallel and Distributed Systems 25.1 (2013): 93-103.
I appreciate very much the references, both to the paper and to the software. Please note that, as was clear in the post, I did not claim that this was original. If you’d like to elaborate further on your experience, please do so.
To keep your false positive rate at 1%, aren’t you limited in the number of members of the population you can map to each 64-bit word? As in, one shouldn’t map more than 100 members into each 64-bit word? Something like that?
Would be nice to clarify the scalability per word here.
The size of the backing array must grow. In my case, I am allocating 12 bits per entry, so if you have 100 entries, you need 1200 bits or about 19 64-bit words. The more distinct entries you have, the more memory you need to allocate.
It works this way with conventional Bloom filters. You need an array that is large enough if you are going to accommodate your input size.
So the scalability is simple. It is linear. 12 bits per entry. You give me the number of entries, I give you the number of words you need. It is rather easy, something like 12 * number bits. It is no different from Bloom filters.
My intuition would be that such small partitions would lead to unacceptable variance. Do you have any analytical results on variance or tail bounds for the FPP?
There’s an estimate here: https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf
Basically they take the lots of little Bloom filters and factor in the variance in populations between the buckets.
Thanks Kendall.
Thanks for the reference. Indeed, section 3 of this paper anticipates all the ideas in this blog post, I think (in particular, 3.1, which describes picking a random k-subset from the B bits of a block, using a lookup table with a single hash function as input). But they don’t seem to have numbers for block sizes other than 512 (64-byte cache line) in the paper. It should be straightforward to compute bounds on FPP for B=64 from equations 3) and 4) in the paper, though.
One thought I had here was to not word-align them and use overlapping buckets, either at byte or even bit granularity. Loads and cache lines would be unaligned of course, but still few in number.
It would create more buckets but possibly even out the variations between them.
It is worth investigating. My concern is that it makes implementation more difficult. E.g., it might be ugly in Java. So you’d want important benefits.
Pretty much the original Bloom filter strategy, no? (i.e. overlapping the bitmaps for all k hash functions). What’s now called the “partitioned” approach (distinct from the “blocked” approach because it separates the bitmaps for each hash function) is more obvious than Bloom’s approach, I think, but performs a bit worse in theory and a bit better in practice (if you believe “A Case for Partitioned Bloom Filters”).
It’s a restriction on the distance between hashes for the same key — it’s a little more work to analyze (basic Bloom filters are easier due to the independence of each bit position).
I tried this with split block Bloom filters and byte-level granularity with blocks of size 256 and lanes of size 32 (that is, setting 1 bit in each 32-bit lane inside an 8-lane block). The difference in false positive probability was negligible – less than switching to 512-bit blocks and 64-bit lanes.
Interesting — thanks for doing that. I suspect it’s equivalent to the original blocked approach.
More than word-aligned buckets, wouldn’t it be a good idea to try with cacheline-aligned buckets? Because the buckets are cacheline-aligned, all of the bit tests would hit the same cacheline. And, as long as the bit tests have no dependency with each other, the processor should still be able to execute them in parallel, hiding the latency of loading multiple words from the cache.
The benefit being that, as each bucket would be 8x or 16x times the size of word, false positives ratio would be lower for a given bloom filter size.
Checked later on https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf and it’s precisely what I was referring to.
Yes a cache-line approach works and can be implemented with SIMD instructions. In the GitHub org I link to, there are such implementations.
Note that it is not a good model to assume that processing a whole cache line is the same as processing a single word. It is more complicated.
I’d definitely recommend over the classic bloom filter design for most applications where performance matters.
We had a good experience using blocked bloom filters (256-bit blocks) for query processing in Apache Impala – any speedup in bloom filter query time was a big win. They also got ported to Apache Kudu – there’s a battle tested implementation with AVX2 support in that codebase.
The newer bloom filter support in Apache Parquet also uses a similar design – https://github.com/apache/parquet-format/blob/master/BloomFilter.md – which is a big win given the amount of data stored in that format (albeit most without bloom filter indices at this point).
That last link is very interesting: it’s actually a hybrid of the “blocking” and “partitioning” approaches (blocks are 32 bytes, partitions are 32 bits). It would be interesting to see analytical or simulation results exploring the parameter space of this hybrid approach.
Hi Tobin! I did some benchmarking, including false positive probability, here: https://arxiv.org/abs/2101.01719
https://github.com/kunzjacq/pybloom
is a python project able to compute the false positive probability of a structure with blocked bloom filters as described in the paper cited above (https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf ). It can also optimize a design based on constraints on the false positive probability or on the storage used.
(full disclosure: I a am the author of said project)
FYI the Java blocked Bloom filter I have written uses a very similar method: two words, and in each word two bits (that might be the same): https://github.com/FastFilter/fastfilter_java/blob/master/fastfilter/src/main/java/org/fastfilter/bloom/BlockedBloom.java#L60
(I’m not claiming this is new, but I didn’t copy it.)
Hm there are some copy & paste problems in the code I posted above… Probably the editor saw some XML tags.
I fixed your comment.
I think there is some paper out there about just storing a hash signature in a hash table, no full key and no value.
If the hash signature was seen before of course you get a positive. It is also possible to get a false positive (collision.)
Maybe it needs too many bits.
Anyway an insert only Robin Hood hash table sounds ideal for that, and easy to implement.
Delete operations for the Robin Hood hash table require maintaining a histogram of displacements and the unwind operation is a little tricky.
Yes, exactly. Cuckoo filters store just a signature, as do quotient filters. The former use cuckoo hashing, while the latter use Robin Hood linear probing.
I think quotient filters use the Cleary algorithm (bidirectional linear probing combined with Knuth’s “quotienting” trick), but they’re still an “ordered hash table” like Robin Hood.
Yeah there are many fingerprint table schemes around, which are theoretically superior to Bloom filters (by a factor of 1/ln 2). However, you can also store just hash codes without losing any information, if you’re hashing integers! Just apply a permutation (aka block cipher) instead of a hash function. Instead of block ciphers, I prefer to use a high-quality mixing function like the Murmur3 finalizer. You can see this approach here: https://github.com/senderista/hashtable-benchmarks.
Thanks for sharing this insightful article.