Different approaches are studied in
“Fast Bloom Filters and Their Generalization” by Y Qiao, et al. (the one you e described here is referred to as “bloom-1” in the paper)

Qiao, Yan, Tao Li, and Shigang Chen. “Fast Bloom filters and their generalization.” IEEE Transactions on Parallel and Distributed Systems 25.1 (2013): 93-103.

I appreciate very much the references, both to the paper and to the software. Please note that, as was clear in the post, I did not claim that this was original. If you’d like to elaborate further on your experience, please do so.

To keep your false positive rate at 1%, aren’t you limited in the number of members of the population you can map to each 64-bit word? As in, one shouldn’t map more than 100 members into each 64-bit word? Something like that?

Would be nice to clarify the scalability per word here.

The size of the backing array must grow. In my case, I am allocating 12 bits per entry, so if you have 100 entries, you need 1200 bits or about 19 64-bit words. The more distinct entries you have, the more memory you need to allocate.

It works this way with conventional Bloom filters. You need an array that is large enough if you are going to accommodate your input size.

So the scalability is simple. It is linear. 12 bits per entry. You give me the number of entries, I give you the number of words you need. It is rather easy, something like 12 * number bits. It is no different from Bloom filters.

My intuition would be that such small partitions would lead to unacceptable variance. Do you have any analytical results on variance or tail bounds for the FPP?

Thanks for the reference. Indeed, section 3 of this paper anticipates all the ideas in this blog post, I think (in particular, 3.1, which describes picking a random k-subset from the B bits of a block, using a lookup table with a single hash function as input). But they don’t seem to have numbers for block sizes other than 512 (64-byte cache line) in the paper. It should be straightforward to compute bounds on FPP for B=64 from equations 3) and 4) in the paper, though.

One thought I had here was to not word-align them and use overlapping buckets, either at byte or even bit granularity. Loads and cache lines would be unaligned of course, but still few in number.

It would create more buckets but possibly even out the variations between them.

It is worth investigating. My concern is that it makes implementation more difficult. E.g., it might be ugly in Java. So you’d want important benefits.

Pretty much the original Bloom filter strategy, no? (i.e. overlapping the bitmaps for all k hash functions). What’s now called the “partitioned” approach (distinct from the “blocked” approach because it separates the bitmaps for each hash function) is more obvious than Bloom’s approach, I think, but performs a bit worse in theory and a bit better in practice (if you believe “A Case for Partitioned Bloom Filters”).

It’s a restriction on the distance between hashes for the same key — it’s a little more work to analyze (basic Bloom filters are easier due to the independence of each bit position).

I tried this with split block Bloom filters and byte-level granularity with blocks of size 256 and lanes of size 32 (that is, setting 1 bit in each 32-bit lane inside an 8-lane block). The difference in false positive probability was negligible – less than switching to 512-bit blocks and 64-bit lanes.

More than word-aligned buckets, wouldn’t it be a good idea to try with cacheline-aligned buckets? Because the buckets are cacheline-aligned, all of the bit tests would hit the same cacheline. And, as long as the bit tests have no dependency with each other, the processor should still be able to execute them in parallel, hiding the latency of loading multiple words from the cache.

The benefit being that, as each bucket would be 8x or 16x times the size of word, false positives ratio would be lower for a given bloom filter size.

I’d definitely recommend over the classic bloom filter design for most applications where performance matters.

We had a good experience using blocked bloom filters (256-bit blocks) for query processing in Apache Impala – any speedup in bloom filter query time was a big win. They also got ported to Apache Kudu – there’s a battle tested implementation with AVX2 support in that codebase.

The newer bloom filter support in Apache Parquet also uses a similar design – https://github.com/apache/parquet-format/blob/master/BloomFilter.md – which is a big win given the amount of data stored in that format (albeit most without bloom filter indices at this point).

That last link is very interesting: it’s actually a hybrid of the “blocking” and “partitioning” approaches (blocks are 32 bytes, partitions are 32 bits). It would be interesting to see analytical or simulation results exploring the parameter space of this hybrid approach.

is a python project able to compute the false positive probability of a structure with blocked bloom filters as described in the paper cited above (https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf ). It can also optimize a design based on constraints on the false positive probability or on the storage used.

(full disclosure: I a am the author of said project)

I think there is some paper out there about just storing a hash signature in a hash table, no full key and no value.
If the hash signature was seen before of course you get a positive. It is also possible to get a false positive (collision.)
Maybe it needs too many bits.
Anyway an insert only Robin Hood hash table sounds ideal for that, and easy to implement.
Delete operations for the Robin Hood hash table require maintaining a histogram of displacements and the unwind operation is a little tricky.

Yes, exactly. Cuckoo filters store just a signature, as do quotient filters. The former use cuckoo hashing, while the latter use Robin Hood linear probing.

I think quotient filters use the Cleary algorithm (bidirectional linear probing combined with Knuth’s “quotienting” trick), but they’re still an “ordered hash table” like Robin Hood.

Yeah there are many fingerprint table schemes around, which are theoretically superior to Bloom filters (by a factor of 1/ln 2). However, you can also store just hash codes without losing any information, if you’re hashing integers! Just apply a permutation (aka block cipher) instead of a hash function. Instead of block ciphers, I prefer to use a high-quality mixing function like the Murmur3 finalizer. You can see this approach here: https://github.com/senderista/hashtable-benchmarks.

jberrymansays:Different approaches are studied in

“Fast Bloom Filters and Their Generalization” by Y Qiao, et al. (the one you e described here is referred to as “bloom-1” in the paper)

Fwiw I’ve implemented bloom-1 as well here

https://hackage.haskell.org/package/unagi-bloomfilter

Daniel Lemiresays:Here is what I think is the full reference:

Qiao, Yan, Tao Li, and Shigang Chen. “Fast Bloom filters and their generalization.” IEEE Transactions on Parallel and Distributed Systems 25.1 (2013): 93-103.

Daniel Lemiresays:I appreciate very much the references, both to the paper and to the software. Please note that, as was clear in the post, I did not claim that this was original. If you’d like to elaborate further on your experience, please do so.

Wolfgang Richtersays:To keep your false positive rate at 1%, aren’t you limited in the number of members of the population you can map to each 64-bit word? As in, one shouldn’t map more than 100 members into each 64-bit word? Something like that?

Would be nice to clarify the scalability per word here.

Daniel Lemiresays:The size of the backing array must grow. In my case, I am allocating 12 bits per entry, so if you have 100 entries, you need 1200 bits or about 19 64-bit words. The more distinct entries you have, the more memory you need to allocate.

It works this way with conventional Bloom filters. You need an array that is large enough if you are going to accommodate your input size.

So the scalability is simple. It is linear. 12 bits per entry. You give me the number of entries, I give you the number of words you need. It is rather easy, something like 12 * number bits. It is no different from Bloom filters.

Tobin Bakersays:My intuition would be that such small partitions would lead to unacceptable variance. Do you have any analytical results on variance or tail bounds for the FPP?

KWilletssays:There’s an estimate here: https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf

Basically they take the lots of little Bloom filters and factor in the variance in populations between the buckets.

Daniel Lemiresays:Thanks Kendall.

Tobin Bakersays:Thanks for the reference. Indeed, section 3 of this paper anticipates all the ideas in this blog post, I think (in particular, 3.1, which describes picking a random k-subset from the B bits of a block, using a lookup table with a single hash function as input). But they don’t seem to have numbers for block sizes other than 512 (64-byte cache line) in the paper. It should be straightforward to compute bounds on FPP for B=64 from equations 3) and 4) in the paper, though.

KWilletssays:One thought I had here was to not word-align them and use overlapping buckets, either at byte or even bit granularity. Loads and cache lines would be unaligned of course, but still few in number.

It would create more buckets but possibly even out the variations between them.

Daniel Lemiresays:It is worth investigating. My concern is that it makes implementation more difficult. E.g., it might be ugly in Java. So you’d want important benefits.

Tobin Bakersays:Pretty much the original Bloom filter strategy, no? (i.e. overlapping the bitmaps for all k hash functions). What’s now called the “partitioned” approach (distinct from the “blocked” approach because it separates the bitmaps for each hash function) is more obvious than Bloom’s approach, I think, but performs a bit worse in theory and a bit better in practice (if you believe “A Case for Partitioned Bloom Filters”).

KWilletssays:It’s a restriction on the distance between hashes for the same key — it’s a little more work to analyze (basic Bloom filters are easier due to the independence of each bit position).

Jim Applesays:I tried this with split block Bloom filters and byte-level granularity with blocks of size 256 and lanes of size 32 (that is, setting 1 bit in each 32-bit lane inside an 8-lane block). The difference in false positive probability was negligible – less than switching to 512-bit blocks and 64-bit lanes.

KWilletssays:Interesting — thanks for doing that. I suspect it’s equivalent to the original blocked approach.

Carlo Alberto Ferrarissays:More than word-aligned buckets, wouldn’t it be a good idea to try with cacheline-aligned buckets? Because the buckets are cacheline-aligned, all of the bit tests would hit the same cacheline. And, as long as the bit tests have no dependency with each other, the processor should still be able to execute them in parallel, hiding the latency of loading multiple words from the cache.

The benefit being that, as each bucket would be 8x or 16x times the size of word, false positives ratio would be lower for a given bloom filter size.

Carlo Alberto Ferrarissays:Checked later on https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf and it’s precisely what I was referring to.

Daniel Lemiresays:Yes a cache-line approach works and can be implemented with SIMD instructions. In the GitHub org I link to, there are such implementations.

Note that it is not a good model to assume that processing a whole cache line is the same as processing a single word. It is more complicated.

Tim Armstrongsays:I’d definitely recommend over the classic bloom filter design for most applications where performance matters.

We had a good experience using blocked bloom filters (256-bit blocks) for query processing in Apache Impala – any speedup in bloom filter query time was a big win. They also got ported to Apache Kudu – there’s a battle tested implementation with AVX2 support in that codebase.

The newer bloom filter support in Apache Parquet also uses a similar design – https://github.com/apache/parquet-format/blob/master/BloomFilter.md – which is a big win given the amount of data stored in that format (albeit most without bloom filter indices at this point).

Tobin Bakersays:That last link is very interesting: it’s actually a hybrid of the “blocking” and “partitioning” approaches (blocks are 32 bytes, partitions are 32 bits). It would be interesting to see analytical or simulation results exploring the parameter space of this hybrid approach.

Jim Applesays:Hi Tobin! I did some benchmarking, including false positive probability, here: https://arxiv.org/abs/2101.01719

KJsays:https://github.com/kunzjacq/pybloom

is a python project able to compute the false positive probability of a structure with blocked bloom filters as described in the paper cited above (https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf ). It can also optimize a design based on constraints on the false positive probability or on the storage used.

(full disclosure: I a am the author of said project)

Thomas Müller Grafsays:FYI the Java blocked Bloom filter I have written uses a very similar method: two words, and in each word two bits (that might be the same): https://github.com/FastFilter/fastfilter_java/blob/master/fastfilter/src/main/java/org/fastfilter/bloom/BlockedBloom.java#L60

(I’m not claiming this is new, but I didn’t copy it.)

Thomas Müller Grafsays:Hm there are some copy & paste problems in the code I posted above… Probably the editor saw some XML tags.

Daniel Lemiresays:I fixed your comment.

Sean O'Connorsays:I think there is some paper out there about just storing a hash signature in a hash table, no full key and no value.

If the hash signature was seen before of course you get a positive. It is also possible to get a false positive (collision.)

Maybe it needs too many bits.

Anyway an insert only Robin Hood hash table sounds ideal for that, and easy to implement.

Delete operations for the Robin Hood hash table require maintaining a histogram of displacements and the unwind operation is a little tricky.

Jim Applesays:Yes, exactly. Cuckoo filters store just a signature, as do quotient filters. The former use cuckoo hashing, while the latter use Robin Hood linear probing.

Tobin Bakersays:I think quotient filters use the Cleary algorithm (bidirectional linear probing combined with Knuth’s “quotienting” trick), but they’re still an “ordered hash table” like Robin Hood.

Tobin Bakersays:Yeah there are many fingerprint table schemes around, which are theoretically superior to Bloom filters (by a factor of 1/ln 2). However, you can also store just hash codes without losing any information, if you’re hashing integers! Just apply a permutation (aka block cipher) instead of a hash function. Instead of block ciphers, I prefer to use a high-quality mixing function like the Murmur3 finalizer. You can see this approach here: https://github.com/senderista/hashtable-benchmarks.

Lilysays:Thanks for sharing this insightful article.