, 6 min read
Word-aligned Bloom filters
Programmers often need to ‘filter out’ data. Suppose that you are given a database of users where only a small percentage are ‘paying customers’ (say 5% or less). You can write an SQL query to check whether a given user is indeed a paying customer, but it might require a round trip to your database engine. Instead, you would like to hold a small ‘filter’ in memory to quickly check whether the given user is a paying customer. When you are pretty sure it might be a paying customer, then, and only then, do you query the database.
Probabilistic filters are ideally suited for such tasks. They return “true” when the key might be found, and “false” when the key definitively cannot be found. A probabilistic filter can be fast and concise.
The conventional probabilistic filter is the Bloom filter. We construct a Bloom filter as follows. We start with an array of bits. Initially, all of the bits are set to 0. When a new value is added to the filter, we map it to several “random” locations in the array of bit. All of the bits at the matching locations are set to 1. The random mapping is done using “hash functions”. A hash function is a function that returns “random-looking” values: it is nevertheless a genuine function. That is, a given hash function always return the same output given the same input.
When querying a Bloom filter, we hash the received value to several locations in the array of bits. If all of the corresponding bits are set to 1, then we have that value might be in the corresponding set. If only just one of these bit values is 0, then we know that the value was not previous added to the set.
A Bloom filter can be remarkably efficient. But can you do better? There are many other probabilistic filters, but let us say that you want to remain close to Bloom filters.
One way to improve on the Bloom filter is to avoid mapping your values all over the array of bits. Instead, you might first map the value to a small block of bits, and then check the bits within a limited range. It can make processing much faster because you do not need multiple data loads in an array. This approach is known under the name of “blocked Bloom filter”. It is much less popular than conventional Bloom filters. One reason why it might be less popular is that blocked Bloom filters require more storage.
I suspect that another reason people might prefer Bloom filters to alternatives such as blocked Bloom filters has to do with implementation. Bloom filters are mature and it is easy to pick up a library. E.g., there is a mature Bloom filter library in Go.
What is the simplest blocked Bloom filter you could do? Instead of doing something fanciful, let us say that the “block” is just a single machine word. On current hardware, a machine word span 64 bits. You have wider registers for SIMD operations, but let us keep things simple. I am going to take my values and map them to a 64-bit word, and I will set a number of bits to 1 within this word, and only within this word. Such a word might be called a fingerprint. Thus, at query time, I will only need to check a single word and compare it with the fingerprint (a bitwise AND followed by a compare). It should be really fast and easy to implement.
Probabilistic filters, as their name implies, have a small probability of getting it wrong: they can claim that a value is in the set while it is not. We call this error margin the “false-positive rate”. You might think that you want this error to be as small as possible, but there is a tradeoff. For a false-positive rate of 1/2p, you need at least p bits per entries. Conventional Bloom filters have an overhead of 44% in the best case scenario, so you actually need 1.44 p bits per entries.
A reasonable choice might be to pick a false-positive rate of 1%. To achieve such a low rate, you need at least 10 bits per entry with a conventional Bloom filter.
It turns out that I can achieve roughly the same false-positive rate while using 2 extra bits per entry for a total of 12 bits per entry. Assume that you take your input values and hash them to a random-looking 64-bit outputs. From such 64-bit ouputs, you can select a location and generate a 64-bit word with 5 bits set. To do so, I can just select 5 bits locations in [0,64), using 6 of the input bits per location. I can create the word using a function such as this one…
std::function<uint64_t(uint64_t)> fingerprint5 = [](uint64_t x) {
return (uint64_t(1) << (x&63))
| (uint64_t(1) << ((x>>6)&63))
| (uint64_t(1) << ((x>>12)&63))
| (uint64_t(1) << ((x>>18)&63))
| (uint64_t(1) << ((x>>24)&63));
};
Though it is in C++, the concept is portable to most languages. You might be concerned that such code could be inefficient. Yet the LLVM clang compiler has no trouble producing code that looks efficient (x64 assembly):
shr rcx, 6
mov eax, 1
shl rax, cl
bts rax, rdx
mov rcx, rdx
shr rcx, 12
bts rax, rcx
mov rcx, rdx
shr rcx, 18
bts rax, rcx
shr rdx, 24
bts rax, rdx
Other compilers could possibly be less efficient. Checking for a match is as simple as the following pseudocode:
uint64_t print = fingerprint(key);
if( (buffer[location(key,buffer.size())] & print) == print ) {
// match
}
I did not write a benchmark yet, but my expectation is that such a word-based Bloom filter would provide excellent performance.
I have published prototypical code on GitHub. In the future, I will provide some benchmarks. There is also a natural extension of this idea where instead of checking a single word, you pick two or more words.
Important: I do not claim that this is a “new” idea. I merely feel that it is an under-appreciated one.