Daniel Lemire's blog

, 45 min read

Word-aligned Bloom filters – Daniel Lemire's blog

30 thoughts on “Word-aligned Bloom filters”

    1. Here is what I think is the full reference:

      Qiao, Yan, Tao Li, and Shigang Chen. “Fast Bloom filters and their generalization.” IEEE Transactions on Parallel and Distributed Systems 25.1 (2013): 93-103.

      1. I appreciate very much the references, both to the paper and to the software. Please note that, as was clear in the post, I did not claim that this was original. If you’d like to elaborate further on your experience, please do so.

  1. To keep your false positive rate at 1%, aren’t you limited in the number of members of the population you can map to each 64-bit word? As in, one shouldn’t map more than 100 members into each 64-bit word? Something like that?

    Would be nice to clarify the scalability per word here.

    1. The size of the backing array must grow. In my case, I am allocating 12 bits per entry, so if you have 100 entries, you need 1200 bits or about 19 64-bit words. The more distinct entries you have, the more memory you need to allocate.

      It works this way with conventional Bloom filters. You need an array that is large enough if you are going to accommodate your input size.

      So the scalability is simple. It is linear. 12 bits per entry. You give me the number of entries, I give you the number of words you need. It is rather easy, something like 12 * number bits. It is no different from Bloom filters.

  2. Tobin Baker says:

    My intuition would be that such small partitions would lead to unacceptable variance. Do you have any analytical results on variance or tail bounds for the FPP?

      1. Tobin Baker says:

        Thanks for the reference. Indeed, section 3 of this paper anticipates all the ideas in this blog post, I think (in particular, 3.1, which describes picking a random k-subset from the B bits of a block, using a lookup table with a single hash function as input). But they don’t seem to have numbers for block sizes other than 512 (64-byte cache line) in the paper. It should be straightforward to compute bounds on FPP for B=64 from equations 3) and 4) in the paper, though.

  3. KWillets says:

    One thought I had here was to not word-align them and use overlapping buckets, either at byte or even bit granularity. Loads and cache lines would be unaligned of course, but still few in number.

    It would create more buckets but possibly even out the variations between them.

    1. Tobin Baker says:

      Pretty much the original Bloom filter strategy, no? (i.e. overlapping the bitmaps for all k hash functions). What’s now called the “partitioned” approach (distinct from the “blocked” approach because it separates the bitmaps for each hash function) is more obvious than Bloom’s approach, I think, but performs a bit worse in theory and a bit better in practice (if you believe “A Case for Partitioned Bloom Filters”).

      1. KWillets says:

        It’s a restriction on the distance between hashes for the same key — it’s a little more work to analyze (basic Bloom filters are easier due to the independence of each bit position).

    2. Jim Apple says:

      I tried this with split block Bloom filters and byte-level granularity with blocks of size 256 and lanes of size 32 (that is, setting 1 bit in each 32-bit lane inside an 8-lane block). The difference in false positive probability was negligible – less than switching to 512-bit blocks and 64-bit lanes.

  4. More than word-aligned buckets, wouldn’t it be a good idea to try with cacheline-aligned buckets? Because the buckets are cacheline-aligned, all of the bit tests would hit the same cacheline. And, as long as the bit tests have no dependency with each other, the processor should still be able to execute them in parallel, hiding the latency of loading multiple words from the cache.

    The benefit being that, as each bucket would be 8x or 16x times the size of word, false positives ratio would be lower for a given bloom filter size.

      1. Yes a cache-line approach works and can be implemented with SIMD instructions. In the GitHub org I link to, there are such implementations.

        Note that it is not a good model to assume that processing a whole cache line is the same as processing a single word. It is more complicated.

  5. Tim Armstrong says:

    I’d definitely recommend over the classic bloom filter design for most applications where performance matters.

    We had a good experience using blocked bloom filters (256-bit blocks) for query processing in Apache Impala – any speedup in bloom filter query time was a big win. They also got ported to Apache Kudu – there’s a battle tested implementation with AVX2 support in that codebase.

    The newer bloom filter support in Apache Parquet also uses a similar design – https://github.com/apache/parquet-format/blob/master/BloomFilter.md – which is a big win given the amount of data stored in that format (albeit most without bloom filter indices at this point).

    1. Tobin Baker says:

      That last link is very interesting: it’s actually a hybrid of the “blocking” and “partitioning” approaches (blocks are 32 bytes, partitions are 32 bits). It would be interesting to see analytical or simulation results exploring the parameter space of this hybrid approach.

  6. Thomas Müller Graf says:

    FYI the Java blocked Bloom filter I have written uses a very similar method: two words, and in each word two bits (that might be the same): https://github.com/FastFilter/fastfilter_java/blob/master/fastfilter/src/main/java/org/fastfilter/bloom/BlockedBloom.java#L60

        public boolean mayContain(long key) {
            long hash = Hash.hash64(key, seed);
            int start = Hash.reduce((int) hash, buckets);
            hash = hash ^ Long.rotateLeft(hash, 32);
            long a = data[start];
            long b = data[start + 1 + (int) (hash >>> 60)];
            long m1 = (1L << hash) | (1L << (hash >> 6));
            long m2 = (1L << (hash >> 12)) | (1L << (hash >> 18));
            return ((m1 & a) == m1) && ((m2 & b) == m2);
        }
    

    (I’m not claiming this is new, but I didn’t copy it.)

    1. Thomas Müller Graf says:

      Hm there are some copy & paste problems in the code I posted above… Probably the editor saw some XML tags.

  7. I think there is some paper out there about just storing a hash signature in a hash table, no full key and no value.
    If the hash signature was seen before of course you get a positive. It is also possible to get a false positive (collision.)
    Maybe it needs too many bits.
    Anyway an insert only Robin Hood hash table sounds ideal for that, and easy to implement.
    Delete operations for the Robin Hood hash table require maintaining a histogram of displacements and the unwind operation is a little tricky.

    1. Jim Apple says:

      Yes, exactly. Cuckoo filters store just a signature, as do quotient filters. The former use cuckoo hashing, while the latter use Robin Hood linear probing.

      1. Tobin Baker says:

        I think quotient filters use the Cleary algorithm (bidirectional linear probing combined with Knuth’s “quotienting” trick), but they’re still an “ordered hash table” like Robin Hood.

    2. Tobin Baker says:

      Yeah there are many fingerprint table schemes around, which are theoretically superior to Bloom filters (by a factor of 1/ln 2). However, you can also store just hash codes without losing any information, if you’re hashing integers! Just apply a permutation (aka block cipher) instead of a hash function. Instead of block ciphers, I prefer to use a high-quality mixing function like the Murmur3 finalizer. You can see this approach here: https://github.com/senderista/hashtable-benchmarks.