Deletions and additions may fail in a Cuckoo filter. Deletions may fail in general and additions will fail probabilistically if you get close to the capacity (say greater than 94%). If you are using it in a dynamic setting, you need a fallback (possibly a reconstruction) and you need tests. This can be abstracted away, but it requires some engineering. And, of course, if the data structure is dynamic, you need concurrency, possibly via locking. That cannot be safely omitted.
David Andersensays:
I’m confused Deletions work if you know the item is in the set. It’s not safe to delete without that surety, of course.
(Not in response to thread): Cool stuff! Be quite wished there was a practical implementation of Bloomier filters when designing our setsep data structures. It would be fun to see if xor filters could be extended to that setting.
Sure, xor filter could be used for this: as in the Bloomier filter, you need a second (external) array which is mutable. Then, 2 bits of each xor filter table entry are used to compute the index within that second array (as in the BDZ MPHF). It would be good to better understand the exact use case…
The xor filter currently only stores fingerprints. But other data can be stored as well. In the “known password database” use case, you could reserve one bit per entry to say if it’s a very common password or not.
Yurasays:
Deletions may fail in general
Which way?
Cuckoo filter even could contain dublicate (small amount of).
Therefore, if you delete element that certainly were in a cuckoo filter, then deletion will not fail.
Damien Carolsays:
It is not a pb for data structure of file format like ORC (https://orc.apache.org/ ) which are generated and can’t be modified.
If you know what has been added and what has been removed from the set, then you can delete from the cuckoo filter. But the cuckoo filter does not give you this information. You have to keep track of it. In effect, you must have access, somehow, to the true content of the set. And, of course, if you delete content, you don’t recover the memory until you rebuild the filter.
Note that it is an issue with all probabilistic filters. Deletions somehow requires you to keep track of what is in the set, but the filter cannot do that. You need to have this information some other way.
Maciej Bilassays:
Immutability is fine, however the Go implementation forces the consumer to have len(keys) * 8 bytes memory available to create the filter.
The memory overhead during construction is implementation dependent. It takes several bytes per entry. Maybe as high as 64 bytes per entry.
Yurasays:
Whoa, quite interesting algorithm.
Yurasays:
Couple of questions: what are FPP for “two segment” filter instead of “three segment”? I suppose, it will take more time to build though, because successful run of “algorithm 3” is less possible. code will be simplier if it follow alhorithm, ie without explicit queue segmentation. Is it runs significantly faster with queue segmentation? Thank you for attention.
When I’ve played with Cuckoo hashes such 2×2 construction gave quite comparable result with 3×1 (in terms of “average achievable load factor”), and most of time it takes only two memory fetches (in terms of cache lines).
And since here will be for places, FPP should be higher. (But fingerprint should be tacken from other bits to be independent from h1 and h3).
Yurasays:
Nope. I’ve achieved only 1.35 oversize, that quite larger than 1.23 🙁
Other interesting thing: FPP is close to 0.37-0.4 for any Xor8: either 2 points, or 3 points, or 4 points. Looks like it is always 1/2^k = 1/256 . Oops, yeah it should be not less than it… shame on me.
Allen Downeysays:
Very nice!
One small typo: “fall positive”
Laurent Querelsays:
How XOR filters compare with Golomb-coded sets in term of memory usage and query speed?
Please see Table 3 in the paper. Xor filters are much faster. The memory usage is not better with GCS.
Frank Rehwinkelsays:
Very nice algorithm – so short and it makes the inspired parts easy to find. Very cool how it starts by finding at least one item that doesn’t have an unencumbered set cell, and then peels that item out, hopefully unencumbering cells in other sets, until with a little luck, they can all be pulled out. And then they are finalized in reverse order because that guarantees they each have an unencumbered hash location that can take on any values of xor necessary when its their turn to store a fingerprint.
I haven’t run a lot of simulations yet but I was surprised that each random set I tried was solved with the very first seed that was tried. The code is there to keep trying with more seeds but I wonder what it would take for that to be necessary. If there were a need to try many seeds – the algorithm could easily be modified to make the lengths slightly longer each time too. For that matter, for a list that was going to be preserved for a very long time, there could be a derivation that tries with successively shorter sets.
Maybe a comment in the code around the splitmix64 function that it has a period of 2^64 which is good since you don’t want the hash to create duplicates – which the algorithm doesn’t detect and doesn’t solve.
Timings are very nice too. Just a few cycles for running Contains, regardless of the number of keys. And tens of cycles per key when building until the number of keys got to 10M at which point it was still just about 150ns per key. I could see rebuilding the filter very often when the cost is so low.
Thanks for providing a golang version. Made the algorithm very easy to follow.
It’s also interesting that doubling the size of the fingerprint from a byte to a half word doesn’t increase the number of items the set would hold – it just reduces the chance of a false positive from 1 in 256 to 1 in 64K.
Jurnisssays:
In the third paragraph, the following sentence seems incomplete:
For example, what if you had a tiny data structure that could reliably tell you that either your password is not in the list of compromised password for sure?
It looks like you meant an “either-or” statement but only covered the case where none of the hash function bits are set. I guess it should be something like: “… either your password might be in the list, or your password is not in the list…”
prataprcsays:
The golang version is quite concise and easy to port to rust-lang.
I am using croaring bitmap for my indexing library, which already
works well.
A quick comparison between xorfilter and croaring, on a MacBookPro,
gives 2-3 times CPU-performance improvement over croaring.
In the paper, the membership test algorithm defined in section 3.1 evaluates h_0(x) given an x from U, but in the previous section, h_0 was defined as a function on S, which is only a subset of U. What am I overlooking? 🙂
I’ve found variant of Xor8+ algo that puts first two lookups in a single cache line. BitsPerValue is almost same as with original Xor8+ variant. (But for non-plus variant is larger since it requires 1.28x fingerprints vs 1.23x fingerprints)
Idea: put first two points into same cache line in a single segment, and third point in a second segment with rank9.
I’ve tried “2 equal segments” and “first segment is twice larger” and “equal segments” looks to give lesser BitsPerValue.
Have you tried increasing the size of the test? I suspect that you might have a poor estimation of the false-positive rate.
prataprcsays:
Increased the test size and false-positive rate consistently stays below 0.40 %. Thanks,
prataprcsays:
Tried to document the sizing requirement for rust implementation of xorfilter here. I hope it is correct. In which case, to index 1Billion keys we may need as high as 50GB of memory footprint ? Any suggestion on how to handle DGM (Disk-Greater-than-Memory) scenarios ?
I think you are computing an upper bound on the memory usage, not a lower bound.
prataprcsays:
Yes, I want to cover the worst case scenario that can lead to OOM or memory crunch. Even without the working array, the footprint for finger_print array alone can go beyong 9GB when we try to index 1 billion items.
Using mmap for xor-filter’s memory requirement is one idea. Want to know better alternatives.
But I think it is important to establish the context. Just storing 1 billion integers in a hash table in C++ or Java can easily use 32 GB or more.
Once you start having a billion of anything in memory, gigabytes of memory are required.
But I take you point.
Ole-Hjalmar Kristensensays:
This is actually where a standard Bloom filter is very good, since you need no extra storage for construction, and you can construct it incrementally. My main use is for tracking which hashes are definitely NOT already in content-addressable storage. With a fixed filter size, the ratio of false positives goes up as you insert hashes, but it’s not critical in an application like this. A gigabyte of bloom filter storage goes a very long way.
That said, it looks like the Xor filter would be great for smaller data sets.
I wonder if, for your use case, do you also need to remove entries at some point (e.g. garbage collection)? For Apache Jackrabbit Oak we use content-addressable storage as well, with garbage collection. We currently don’t use an approximate data structure.
Franksays:
I just read through the paper and was a bit confused by the Assigning Step. Is B supposed to be initialized with zeros or random values at that point? In the Construction Step, it seems to still be uninitialized when assign() is called with it.
I have not checked the paper recently, but it actually does not matter. You can start with a table that has garbage in it since all that matters is that XOR be the value you desire and each time you just change one slot. The construction does not need to be deterministic. You have a lot of freedom.
Bob Harrissays:
(Having read the paper)
I’m curious, what is the reason for h_0, h_1 and h_2 mapping to non-overlapping intervals of |B|?
Is it simply to make sure h_0(x), h_1(x), h_2(x), have three distinct values? Or is there some property of the random graph that it wouldn’t have if each hash function had the full range of |B| but otherwise avoided duplicates?
(Probably I can find the answer if I follow the Botelho cite).
There is obviously the problem that you are going to get a very small number of collisions if you don’t restrict the hash values, but I do not think that it is the primary concern. My expectation is that you are going to get a higher failure probability without the constraints.
Bob Harrissays:
I was thinking of a collision-free hash scheme. E.g. with h_0′(x) in 0..|B-3|, h_1′(x) in 0..|B-2| and h_2′(x) in 0..|B-1|. h_1′(x) and h_0′(x)+h_1′(x) mod |B-1| would be two distinct values in 0..|B-2|. Similarly, adding h_2′(x) to those, mod |B|, gives a hashed triple with no duplicates.
Having now skimmed the Botelho paper, I think their proof depends on the graph being 3-partite. It’s not clear to me, yet, whether the same property (whatever it is that makes the failure probability small) would hold without it being 3-partite, but still collision-free.
And if that were true, I wonder if it would lead to a reduction of the 1.23 factor.
And if that were true, I wonder if it would lead to a reduction of the
1.23 factor.
I expect that the answer is negative. You will need a higher (worse) ratio.
But it is not hard to test it out!
Bob Harrissays:
Some initial experiments suggest a collision-free triplet hash may be a very very slight improvement over the tripartite hash.
It seems to reduce the number of average number of attempts needed to find a suitable graph. But the reduction (if it really exists) is so slight as to not be worth the effort, maybe 1%. And this doesn’t extrapolate to a similar reduction in the capacity expansion factor.
Specifically, I ran five experiments, each with 10K trials on 1K keys. In 4 of the 5 the triplet hash reduced avg attempts (e.g. from 1.116 to 1.105). In the fifth experiment it increased avg by about 0.1%.
Obviously that’s not a thorough experiment. For one thing, the reduction is so small that it could be within the variance for 10K trials. Still, getting a “win” on 4 of 5 attempts seems promising.
I then tried to see if it would allow reduction of that 1.23. But I quickly discovered (as you probably already know) how sensitive the process is to changes in that factor. Reducing that will obviously raise the avg number of attempts. My hope was that changing that to, say, 1.22 might make the new hash scheme comparable to the tripartite hash with 1.23. But that small change raises the avg attempts by about 10% (for both hash schemes). At that point I inferred that I might only get a reduction to something like 1.229 at best, and I gave up.
P.S. my earlier description of the collision-free triplet hash was wrong in some details, but probably demonstrates the concept.
You had me all way until immutable. I used a cuckoo filter to speed up checking for existing relationships a while back on https://maxdemarzi.com/2017/07/13/using-a-cuckoo-filter-for-unique-relationships/
Deletions and additions may fail in a Cuckoo filter. Deletions may fail in general and additions will fail probabilistically if you get close to the capacity (say greater than 94%). If you are using it in a dynamic setting, you need a fallback (possibly a reconstruction) and you need tests. This can be abstracted away, but it requires some engineering. And, of course, if the data structure is dynamic, you need concurrency, possibly via locking. That cannot be safely omitted.
I’m confused Deletions work if you know the item is in the set. It’s not safe to delete without that surety, of course.
(Not in response to thread): Cool stuff! Be quite wished there was a practical implementation of Bloomier filters when designing our setsep data structures. It would be fun to see if xor filters could be extended to that setting.
Sure, xor filter could be used for this: as in the Bloomier filter, you need a second (external) array which is mutable. Then, 2 bits of each xor filter table entry are used to compute the index within that second array (as in the BDZ MPHF). It would be good to better understand the exact use case…
The xor filter currently only stores fingerprints. But other data can be stored as well. In the “known password database” use case, you could reserve one bit per entry to say if it’s a very common password or not.
Which way?
Cuckoo filter even could contain dublicate (small amount of).
Therefore, if you delete element that certainly were in a cuckoo filter, then deletion will not fail.
It is not a pb for data structure of file format like ORC (https://orc.apache.org/ ) which are generated and can’t be modified.
@David and @Yura
If you know what has been added and what has been removed from the set, then you can delete from the cuckoo filter. But the cuckoo filter does not give you this information. You have to keep track of it. In effect, you must have access, somehow, to the true content of the set. And, of course, if you delete content, you don’t recover the memory until you rebuild the filter.
Note that it is an issue with all probabilistic filters. Deletions somehow requires you to keep track of what is in the set, but the filter cannot do that. You need to have this information some other way.
Immutability is fine, however the Go implementation forces the consumer to have
len(keys) * 8
bytes memory available to create the filter.func Populate(keys []uint64) *Xor8
Something like:
func FromIterator(next func() uint64, size int) *Xor8
might make sense as well.
@Maciej
If you cannot afford 8 bytes per entry, then you are not going to be able to build the filter…
(Update: once constructed, the filter will use very little memory. It is only during construction that you need several bytes per entry.)
If this is true, how is 8 per entry better?
A filter (once constructed) use about 8 bits per entry, not 8 bytes.
How big is the memory overhead during filter construction. I couldn’t get a good idea of it with my limited skim of the paper.
The memory overhead during construction is implementation dependent. It takes several bytes per entry. Maybe as high as 64 bytes per entry.
Whoa, quite interesting algorithm.
Couple of questions: what are FPP for “two segment” filter instead of “three segment”? I suppose, it will take more time to build though, because successful run of “algorithm 3” is less possible. code will be simplier if it follow alhorithm, ie without explicit queue segmentation. Is it runs significantly faster with queue segmentation? Thank you for attention.
The FPP is significantly worse if you use only two segments. I don’t know the formula (though it is easy to find) but it is not practical.
There are many strategies that one can use to build the filter, and yes, there is a lot of room for optimization.
I could suggest “two segment – quad place” scheme:
side := hash&1
h1 := reduce(hash>>1, blocklength)
h2 := h1 + 1 + side
if h2 > blocklength { h2-=blocklength }
h3 := reduce(hash>>33, blocklength)
h4 := h3 + (2 - side)
if h4 > blocklentgh {h4-=blocklength}
h3 += blocklength
h4 += blocklength
When I’ve played with Cuckoo hashes such 2×2 construction gave quite comparable result with 3×1 (in terms of “average achievable load factor”), and most of time it takes only two memory fetches (in terms of cache lines).
And since here will be for places, FPP should be higher. (But fingerprint should be tacken from other bits to be independent from h1 and h3).
Nope. I’ve achieved only 1.35 oversize, that quite larger than 1.23 🙁
Other interesting thing: FPP is close to 0.37-0.4 for any Xor8: either 2 points, or 3 points, or 4 points. Looks like it is always 1/2^k = 1/256 . Oops, yeah it should be not less than it… shame on me.
Very nice!
One small typo: “fall positive”
How XOR filters compare with Golomb-coded sets in term of memory usage and query speed?
Please see Table 3 in the paper. Xor filters are much faster. The memory usage is not better with GCS.
Very nice algorithm – so short and it makes the inspired parts easy to find. Very cool how it starts by finding at least one item that doesn’t have an unencumbered set cell, and then peels that item out, hopefully unencumbering cells in other sets, until with a little luck, they can all be pulled out. And then they are finalized in reverse order because that guarantees they each have an unencumbered hash location that can take on any values of xor necessary when its their turn to store a fingerprint.
I haven’t run a lot of simulations yet but I was surprised that each random set I tried was solved with the very first seed that was tried. The code is there to keep trying with more seeds but I wonder what it would take for that to be necessary. If there were a need to try many seeds – the algorithm could easily be modified to make the lengths slightly longer each time too. For that matter, for a list that was going to be preserved for a very long time, there could be a derivation that tries with successively shorter sets.
Maybe a comment in the code around the splitmix64 function that it has a period of 2^64 which is good since you don’t want the hash to create duplicates – which the algorithm doesn’t detect and doesn’t solve.
Timings are very nice too. Just a few cycles for running Contains, regardless of the number of keys. And tens of cycles per key when building until the number of keys got to 10M at which point it was still just about 150ns per key. I could see rebuilding the filter very often when the cost is so low.
Thanks for providing a golang version. Made the algorithm very easy to follow.
It’s also interesting that doubling the size of the fingerprint from a byte to a half word doesn’t increase the number of items the set would hold – it just reduces the chance of a false positive from 1 in 256 to 1 in 64K.
In the third paragraph, the following sentence seems incomplete:
It looks like you meant an “either-or” statement but only covered the case where none of the hash function bits are set. I guess it should be something like: “… either your password might be in the list, or your password is not in the list…”
The golang version is quite concise and easy to port to rust-lang.
I am using croaring bitmap for my indexing library, which already
works well.
A quick comparison between xorfilter and croaring, on a MacBookPro,
gives 2-3 times CPU-performance improvement over croaring.
Ref: https://github.com/bnclabs/xorfilter
Also did a rust port from the go version, and tried to do a bit more idiomatic – but you beat me to it 😀 https://github.com/Polochon-street/rustxorfilter
In the paper, the membership test algorithm defined in section 3.1 evaluates h_0(x) given an x from U, but in the previous section, h_0 was defined as a function on S, which is only a subset of U. What am I overlooking? 🙂
Good catch.
I’ve found variant of Xor8+ algo that puts first two lookups in a single cache line. BitsPerValue is almost same as with original Xor8+ variant. (But for non-plus variant is larger since it requires 1.28x fingerprints vs 1.23x fingerprints)
Idea: put first two points into same cache line in a single segment, and third point in a second segment with rank9.
I’ve tried “2 equal segments” and “first segment is twice larger” and “equal segments” looks to give lesser BitsPerValue.
Changes are demonstrated in Go variang (without actual Rank9 construction, just with statistic calculation): https://github.com/FastFilter/xorfilter/pull/11/commits/c7b541fe86297e75073e0d193173feb7a2dd5180
https://github.com/FastFilter/xorfilter/pull/11
Also https://github.com/FastFilter/xorfilter/pull/10 fixes Go implementation to be closer to intended Xor8+ implementation.
The rust implementation has a failure in release mode due the following assertion.
let fpp = matches * 100.0 / (falsesize as f64);
println!("false positive rate {}%", fpp);
assert!(fpp < 0.40, "fpp({}) >= 0.40", fpp);
Below is the output.
---- tests::test_basic2 stdout ----
seed 5965087604022038976
bits per entry 9.864 bits
false positive rate 0.4027%
thread 'tests::test_basic2' panicked at 'fpp(0.4027) >= 0.40', src\lib.rs:454:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
Should I relax the “false-positive-rate” to
< 0.50
?Ref: https://github.com/bnclabs/xorfilter/pull/7
Thanks,
Have you tried increasing the size of the test? I suspect that you might have a poor estimation of the false-positive rate.
Increased the test size and false-positive rate consistently stays below 0.40 %. Thanks,
Tried to document the sizing requirement for rust implementation of xorfilter here. I hope it is correct. In which case, to index 1Billion keys we may need as high as 50GB of memory footprint ? Any suggestion on how to handle DGM (Disk-Greater-than-Memory) scenarios ?
I think you are computing an upper bound on the memory usage, not a lower bound.
Yes, I want to cover the worst case scenario that can lead to OOM or memory crunch. Even without the working array, the footprint for finger_print array alone can go beyong 9GB when we try to index 1 billion items.
Using mmap for xor-filter’s memory requirement is one idea. Want to know better alternatives.
I will take the discussion over to your issue.
But I think it is important to establish the context. Just storing 1 billion integers in a hash table in C++ or Java can easily use 32 GB or more.
Once you start having a billion of anything in memory, gigabytes of memory are required.
But I take you point.
This is actually where a standard Bloom filter is very good, since you need no extra storage for construction, and you can construct it incrementally. My main use is for tracking which hashes are definitely NOT already in content-addressable storage. With a fixed filter size, the ratio of false positives goes up as you insert hashes, but it’s not critical in an application like this. A gigabyte of bloom filter storage goes a very long way.
That said, it looks like the Xor filter would be great for smaller data sets.
I wonder if, for your use case, do you also need to remove entries at some point (e.g. garbage collection)? For Apache Jackrabbit Oak we use content-addressable storage as well, with garbage collection. We currently don’t use an approximate data structure.
I just read through the paper and was a bit confused by the Assigning Step. Is B supposed to be initialized with zeros or random values at that point? In the Construction Step, it seems to still be uninitialized when assign() is called with it.
I have not checked the paper recently, but it actually does not matter. You can start with a table that has garbage in it since all that matters is that XOR be the value you desire and each time you just change one slot. The construction does not need to be deterministic. You have a lot of freedom.
(Having read the paper)
I’m curious, what is the reason for h_0, h_1 and h_2 mapping to non-overlapping intervals of |B|?
Is it simply to make sure h_0(x), h_1(x), h_2(x), have three distinct values? Or is there some property of the random graph that it wouldn’t have if each hash function had the full range of |B| but otherwise avoided duplicates?
(Probably I can find the answer if I follow the Botelho cite).
There is obviously the problem that you are going to get a very small number of collisions if you don’t restrict the hash values, but I do not think that it is the primary concern. My expectation is that you are going to get a higher failure probability without the constraints.
I was thinking of a collision-free hash scheme. E.g. with h_0′(x) in 0..|B-3|, h_1′(x) in 0..|B-2| and h_2′(x) in 0..|B-1|. h_1′(x) and h_0′(x)+h_1′(x) mod |B-1| would be two distinct values in 0..|B-2|. Similarly, adding h_2′(x) to those, mod |B|, gives a hashed triple with no duplicates.
Having now skimmed the Botelho paper, I think their proof depends on the graph being 3-partite. It’s not clear to me, yet, whether the same property (whatever it is that makes the failure probability small) would hold without it being 3-partite, but still collision-free.
And if that were true, I wonder if it would lead to a reduction of the 1.23 factor.
I expect that the answer is negative. You will need a higher (worse) ratio.
But it is not hard to test it out!
Some initial experiments suggest a collision-free triplet hash may be a very very slight improvement over the tripartite hash.
It seems to reduce the number of average number of attempts needed to find a suitable graph. But the reduction (if it really exists) is so slight as to not be worth the effort, maybe 1%. And this doesn’t extrapolate to a similar reduction in the capacity expansion factor.
Specifically, I ran five experiments, each with 10K trials on 1K keys. In 4 of the 5 the triplet hash reduced avg attempts (e.g. from 1.116 to 1.105). In the fifth experiment it increased avg by about 0.1%.
Obviously that’s not a thorough experiment. For one thing, the reduction is so small that it could be within the variance for 10K trials. Still, getting a “win” on 4 of 5 attempts seems promising.
I then tried to see if it would allow reduction of that 1.23. But I quickly discovered (as you probably already know) how sensitive the process is to changes in that factor. Reducing that will obviously raise the avg number of attempts. My hope was that changing that to, say, 1.22 might make the new hash scheme comparable to the tripartite hash with 1.23. But that small change raises the avg attempts by about 10% (for both hash schemes). At that point I inferred that I might only get a reduction to something like 1.229 at best, and I gave up.
P.S. my earlier description of the collision-free triplet hash was wrong in some details, but probably demonstrates the concept.