Word-aligned Bloom filters – Daniel Lemire's blog

jberryman says:

October 3, 2021 at 6:06 pm

Different approaches are studied in
“Fast Bloom Filters and Their Generalization” by Y Qiao, et al. (the one you e described here is referred to as “bloom-1” in the paper)

Fwiw I’ve implemented bloom-1 as well here
https://hackage.haskell.org/package/unagi-bloomfilter

Reply

Daniel Lemire says:

October 3, 2021 at 6:18 pm

Here is what I think is the full reference:

Qiao, Yan, Tao Li, and Shigang Chen. “Fast Bloom filters and their generalization.” IEEE Transactions on Parallel and Distributed Systems 25.1 (2013): 93-103.

Reply
1. Daniel Lemire says:
  
  October 3, 2021 at 6:21 pm
  
  I appreciate very much the references, both to the paper and to the software. Please note that, as was clear in the post, I did not claim that this was original. If you’d like to elaborate further on your experience, please do so.
  
  Reply

Wolfgang Richter says:

October 3, 2021 at 9:14 pm

To keep your false positive rate at 1%, aren’t you limited in the number of members of the population you can map to each 64-bit word? As in, one shouldn’t map more than 100 members into each 64-bit word? Something like that?

Would be nice to clarify the scalability per word here.

Reply

Daniel Lemire says:

October 3, 2021 at 9:35 pm

The size of the backing array must grow. In my case, I am allocating 12 bits per entry, so if you have 100 entries, you need 1200 bits or about 19 64-bit words. The more distinct entries you have, the more memory you need to allocate.

It works this way with conventional Bloom filters. You need an array that is large enough if you are going to accommodate your input size.

So the scalability is simple. It is linear. 12 bits per entry. You give me the number of entries, I give you the number of words you need. It is rather easy, something like 12 * number bits. It is no different from Bloom filters.

Reply

Tobin Baker says:

October 3, 2021 at 10:28 pm

My intuition would be that such small partitions would lead to unacceptable variance. Do you have any analytical results on variance or tail bounds for the FPP?

Reply

KWillets says:

October 3, 2021 at 10:34 pm

There’s an estimate here: https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf

Basically they take the lots of little Bloom filters and factor in the variance in populations between the buckets.

Reply
1. Daniel Lemire says:
  
  October 3, 2021 at 10:43 pm
  
  Thanks Kendall.
  
  Reply
2. Tobin Baker says:
  
  October 3, 2021 at 10:52 pm
  
  Thanks for the reference. Indeed, section 3 of this paper anticipates all the ideas in this blog post, I think (in particular, 3.1, which describes picking a random k-subset from the B bits of a block, using a lookup table with a single hash function as input). But they don’t seem to have numbers for block sizes other than 512 (64-byte cache line) in the paper. It should be straightforward to compute bounds on FPP for B=64 from equations 3) and 4) in the paper, though.
  
  Reply

KWillets says:

October 3, 2021 at 10:45 pm

One thought I had here was to not word-align them and use overlapping buckets, either at byte or even bit granularity. Loads and cache lines would be unaligned of course, but still few in number.

It would create more buckets but possibly even out the variations between them.

Reply

Daniel Lemire says:

October 3, 2021 at 10:47 pm

It is worth investigating. My concern is that it makes implementation more difficult. E.g., it might be ugly in Java. So you’d want important benefits.

Reply
Tobin Baker says:

October 3, 2021 at 10:57 pm

Pretty much the original Bloom filter strategy, no? (i.e. overlapping the bitmaps for all k hash functions). What’s now called the “partitioned” approach (distinct from the “blocked” approach because it separates the bitmaps for each hash function) is more obvious than Bloom’s approach, I think, but performs a bit worse in theory and a bit better in practice (if you believe “A Case for Partitioned Bloom Filters”).

Reply
1. KWillets says:
  
  October 3, 2021 at 11:34 pm
  
  It’s a restriction on the distance between hashes for the same key — it’s a little more work to analyze (basic Bloom filters are easier due to the independence of each bit position).
  
  Reply
Jim Apple says:

October 9, 2021 at 11:34 pm

I tried this with split block Bloom filters and byte-level granularity with blocks of size 256 and lanes of size 32 (that is, setting 1 bit in each 32-bit lane inside an 8-lane block). The difference in false positive probability was negligible – less than switching to 512-bit blocks and 64-bit lanes.

Reply
1. KWillets says:
  
  October 11, 2021 at 5:16 pm
  
  Interesting — thanks for doing that. I suspect it’s equivalent to the original blocked approach.
  
  Reply

Carlo Alberto Ferraris says:

October 4, 2021 at 12:16 am

More than word-aligned buckets, wouldn’t it be a good idea to try with cacheline-aligned buckets? Because the buckets are cacheline-aligned, all of the bit tests would hit the same cacheline. And, as long as the bit tests have no dependency with each other, the processor should still be able to execute them in parallel, hiding the latency of loading multiple words from the cache.

The benefit being that, as each bucket would be 8x or 16x times the size of word, false positives ratio would be lower for a given bloom filter size.

Reply

Carlo Alberto Ferraris says:

October 4, 2021 at 12:17 am

Checked later on https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf and it’s precisely what I was referring to.

Reply
1. Daniel Lemire says:
  
  October 4, 2021 at 12:36 am
  
  Yes a cache-line approach works and can be implemented with SIMD instructions. In the GitHub org I link to, there are such implementations.
  
  Note that it is not a good model to assume that processing a whole cache line is the same as processing a single word. It is more complicated.
  
  Reply

Tim Armstrong says:

October 4, 2021 at 5:26 am

I’d definitely recommend over the classic bloom filter design for most applications where performance matters.

We had a good experience using blocked bloom filters (256-bit blocks) for query processing in Apache Impala – any speedup in bloom filter query time was a big win. They also got ported to Apache Kudu – there’s a battle tested implementation with AVX2 support in that codebase.

The newer bloom filter support in Apache Parquet also uses a similar design – https://github.com/apache/parquet-format/blob/master/BloomFilter.md – which is a big win given the amount of data stored in that format (albeit most without bloom filter indices at this point).

Reply

Tobin Baker says:

October 4, 2021 at 6:29 pm

That last link is very interesting: it’s actually a hybrid of the “blocking” and “partitioning” approaches (blocks are 32 bytes, partitions are 32 bits). It would be interesting to see analytical or simulation results exploring the parameter space of this hybrid approach.

Reply
1. Jim Apple says:
  
  October 9, 2021 at 11:42 pm
  
  Hi Tobin! I did some benchmarking, including false positive probability, here: https://arxiv.org/abs/2101.01719
  
  Reply

KJ says:

October 4, 2021 at 9:20 am

https://github.com/kunzjacq/pybloom

is a python project able to compute the false positive probability of a structure with blocked bloom filters as described in the paper cited above (https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf ). It can also optimize a design based on constraints on the false positive probability or on the storage used.

(full disclosure: I a am the author of said project)

Reply

Thomas Müller Graf says:

October 7, 2021 at 6:32 am

FYI the Java blocked Bloom filter I have written uses a very similar method: two words, and in each word two bits (that might be the same): https://github.com/FastFilter/fastfilter_java/blob/master/fastfilter/src/main/java/org/fastfilter/bloom/BlockedBloom.java#L60

    public boolean mayContain(long key) {
        long hash = Hash.hash64(key, seed);
        int start = Hash.reduce((int) hash, buckets);
        hash = hash ^ Long.rotateLeft(hash, 32);
        long a = data[start];
        long b = data[start + 1 + (int) (hash >>> 60)];
        long m1 = (1L << hash) | (1L << (hash >> 6));
        long m2 = (1L << (hash >> 12)) | (1L << (hash >> 18));
        return ((m1 & a) == m1) && ((m2 & b) == m2);
    }

(I’m not claiming this is new, but I didn’t copy it.)

Reply

Thomas Müller Graf says:

October 7, 2021 at 6:35 am

Hm there are some copy & paste problems in the code I posted above… Probably the editor saw some XML tags.

Reply
1. Daniel Lemire says:
  
  October 7, 2021 at 9:23 pm
  
  I fixed your comment.
  
  Reply

Sean O'Connor says:

October 10, 2021 at 9:34 am

I think there is some paper out there about just storing a hash signature in a hash table, no full key and no value.
If the hash signature was seen before of course you get a positive. It is also possible to get a false positive (collision.)
Maybe it needs too many bits.
Anyway an insert only Robin Hood hash table sounds ideal for that, and easy to implement.
Delete operations for the Robin Hood hash table require maintaining a histogram of displacements and the unwind operation is a little tricky.

Reply

Jim Apple says:

October 10, 2021 at 6:23 pm

Yes, exactly. Cuckoo filters store just a signature, as do quotient filters. The former use cuckoo hashing, while the latter use Robin Hood linear probing.

Reply
1. Tobin Baker says:
  
  October 11, 2021 at 11:22 pm
  
  I think quotient filters use the Cleary algorithm (bidirectional linear probing combined with Knuth’s “quotienting” trick), but they’re still an “ordered hash table” like Robin Hood.
  
  Reply
Tobin Baker says:

October 11, 2021 at 11:27 pm

Yeah there are many fingerprint table schemes around, which are theoretically superior to Bloom filters (by a factor of 1/ln 2). However, you can also store just hash codes without losing any information, if you’re hashing integers! Just apply a permutation (aka block cipher) instead of a hash function. Instead of block ciphers, I prefer to use a high-quality mixing function like the Murmur3 finalizer. You can see this approach here: https://github.com/senderista/hashtable-benchmarks.

Reply

Lily says:

June 2, 2022 at 6:37 am

Thanks for sharing this insightful article.

Reply