The array backed FastBitSet will work great when n is small enough, in particular when it happens to fit in the L2 cache (or better). Here you use n = 1,000,000 which only requires ~122 KiB of memory (assuming the JavaScript FastBitSet works like the Java one – i.e., doesn’t have any optimization for sparse sets).
What happens for n = 800,000,000 and some m values (that don’t trigger the m * 1024 fallback condition? This should no longer fit in L2 (or even L3 unless you have some extreme type of CPU), and here the FastBitSet approach will probably fall behind (or at least the gap will narrow).
The naive solutions (certainly in Java, and it’s probably not better in JavaScript) are really hurt by the fact that HashSet is frankly a terrible set for integers, memory-wise: each Integer probably has an overhead of 2000% or so (say 80 bytes per 4 byte integer). A simple int[] based hash set would do much better (or there are plenty of good third party libraries too).
Yes, JavaScript FastBitSet works much like Java’s BitSet.
If you look at my code, in the post, you will notice that I do fall back on the regular, hash-based algorithm when a bitset is obviously inefficient. Look again:
See the “1024” factor? Well… 1024 is kind of arbitrary, you can tweak it.
A simple int[] based hash set would do much better (or there are plenty of good third party libraries too).
Using the following script I estimate the memory usage of the JavaScript Set to be around 40 bytes per entry:
for (let key in process.memoryUsage()) {
console.log(`${key} ${Math.round(process.memoryUsage()[key] / 1024 / 1024 * 100) / 100} MB`);
}
var s = new Set();
for (let i = 0; i < 1000000 ; i ++) {
s.add(i);
}
for (let key in process.memoryUsage()) {
console.log(`${key} ${Math.round(process.memoryUsage()[key] / 1024 / 1024 * 100) / 100} MB`);
}
Obviously, 40 bytes is something like an order of magnitude more than needed.
It could be tricky to do better (though not impossible). JavaScript does have programmer-specified typed arrays, but they do not come for free performance-wise (which is odd but there you go). However, JavaScript is smart enough to recognize typed arrays by their usage. So it should be possible to do better…
Travis Downssays:
Right, I saw the fallback code which is why I mentioned “[without triggering] the m * 1024 fallback condition” (let’s call n/m the “density factor”). If we say each integer takes 40 bytes as you found, then the cutover point for memory use would be around a density factor of 320 (since each integer uses 320 bits), not far off of 1024. The “performance” cutover point will generally be higher than the memory one (since even when bitset uses more memory it is simpler and will generally be faster), so perhaps 1024 is a good choice.
My points are more along the lines of:
The BitSet in Java and this one for JavaScipt happen to have a representation that is compact and “hardware sympathetic” compared to the ideal representation (i.e., very close to 1 actual bit used per bit), whereas the use of generic Set for an explicit set of integers tends to have quite a terrible representation, using 10x more memory than needed. So the BitSet approaches have a giant built-in lead in this comparison due to differences gap between actual and ideal implementation. It would be interesting to compare against Set implementations that didn’t suck so much for this purpose.
The memory use of the BitSet approach grows with n, while the Set approach grows with m. So I expect there to be some range where Set is better, but this range is much smaller than it has to be due to (1). In particular, there will be “plateaus” in the performance of each implementation, at different locations, based on the memory-intensity of the algorithm and the cache sizes. These will mostly become interesting when the working set moves out of L2 into L3 since I suspect out-of-order will have a generally easy time hiding even the L2 latency since both algorithms have “unlimited” MLP. As it happens n = 1,000,000 fits nicely into L2 for both algorithms, so it’s a point where expect the BitSet advantage to be relatively higher (i.e., the plateaus overlap here).
A few other notes:
There are really two separate things being tested here: the use of BitSet to store an integer set instead of Set, and the use of the negate() approach to make the problem almost symmetric around m == n/2, rather than very slow when m > n/2. The negate() thing I think is a common approach and also one you can/should apply to any algorithm (indeed, you might take it a step further and remove the brute-force “negate” entirely and propagate the “is negated” state to your consuming code which could often consume the negated set for free or nearly so).
The set representation question, on the other hand, I find a bit deeper and more interesting from a performance point of view. For example, one thing not mentioned is that these algorithms don’t return the same type of object – some return a Set and some a BitSet. Which is more convenient depends on the consumer: if you want to consumer the values in sorted order, BitSet does that “for free” as a consequence of how it is represented. If any old order is fine, however, then the Set representation is probably more efficient (but this is also subject to issue (1) above) since you can just iterate though the already-materialized values in underlying hash versus iterating through the BitSet (here there is probably also a cutover point depending on the density factor).
However, JavaScript is smart enough to recognize typed arrays by their usage. So it should be possible to do better…
Indeed – doing better is just a matter of using an array of integers as the basis for your set – I’m quite sure that under the covers this is exactly what FastBitSet does. I’m not actually 100% sure how typed-arrays and bitwise math works in JavaScipt: internally number are using 8-bit floats – but does this allow you to use all 64-bits of a value as if it were a 64-bit integer to implement BitSet?
In Java it is clear how it works, mostly because you have the explicit distinction between primitive and boxed values. BitSet uses a long[] and so is pretty much the ideal representation (and bulk operations can even be vectorized) – and you have zero objects per “entry”. Set uses an Object[] pointing to Integer objects, so you have 2 objects and about ~54 bytes per entry. Fast integer set implementations use an int[] and have 0 objects and 4 bytes per entry (all numbers are the asymptotic limits – obviously the array itself counts as 1 object, and there are resizing considerations, but these go quickly to 0 on a per-element basis as the size increases).
Don’t take any of this as bashing BitSet – they are a great representation when sets are relative dense. I just want to make the tradeoff clear and point out that the fully optimized cutover point would be different than the “optimized vs not” cutover you might calculate from this post.
Finally, if you really want to do this quickly, I think a much faster approach for most density is as follows:
From your unbiased random bitstream (i.e., your source of randomness like random() or whatever), generate a biased bitstream where each bit is set with probability p’ ~= p where p is m/n, i.e., the probability that any given value will appear in the final set. I.e., you generate the BitSet representation directly, rather than starting with all zeros and setting bits one-by-one. This lets you, for example, generate the BitSet 64 bits (or 128 or 256 with SIMD) at a time, and the generating operation is much simpler and free of branches.
Now you’d have to be pretty lucky (like, really lucky for n = 1,000,000) for this to give you exactly the right m in the end, but if you don’t need exactly m values – but just something close you can stop here: the resultant number of set values (m’) will generally be very close to m (for example, with m=500,000 and n=1,000,000 we have m’ in the range m +/- 1000 about 95% of the time based on my rusty math and an online calculator – so usually within about 0.2% of the expected value).
Now, we should be fair and do apples to apples and get an exact m – so all you have to do is to apply any algorithm discussed in Daniel’s post to correct the selected values (either adding or removing) until we reach the correct m.
For density factors near 0.5, I’m will to bet this is an order of magnitude faster than the above approaches: and it only uses ~1 bit of entropy per value, rather than 32 (or whatever Random.getInt() uses per invocation). It is also very amenable to vectorization and other techniques as well. Note also that it is symmetric: you don’t really need the concept of negation because the behavior is naturally the same at m and (n – m), since we aren’t really distinguishing between chosen and unchosen values.
The performance at other density values is tougher to characterize: you need to generate the biased bit sequence, which is harder when p != 0.5, but still generally fast. In general you can pick a p’ different than p to optimize the overall performance since some p’ values are faster to generate (especially those that have the form 1/(2^x) for small x) and you will ultimately do a corrective step anyways. Generating the biases bit streams is an interesting topic in itself!
I think this approach is bias-free: every value is selected independently and the correction step seems bias-free also. If you are willing to accept small biases you can probably do it even faster.
This approach will be so fast that things like “how is your entropy generated” become important – if you use a slow RNG it will be the limiting factor! It’s also worth noting that the amount of entropy consumed, for the simple/fastest biased bitstream generation methods, depend on p in a “backwards” way: it is smallest near p = 0.5 and gets larger at the extremes, even though at the extremes the amount of information-theoretic entropy in the returned set is much lower (indeed, at m = 1 the other methods need only 1 random number, while this technique would need much more). So again there will be cutover points where this doesn’t make sense, which will depend on how expensive entropy is to generate.
(…) internally number are using 8-bit floats – but does this allow you to use all 64-bits of a value as if it were a 64-bit integer to implement BitSet?
JavaScript supports 32-bit integers in practice.
Generating the biases bit streams is an interesting topic in itself! (…) I think this approach is bias-free: every value is selected independently and the correction step seems bias-free also.
Agreed.
Travis Downssays:
JavaScript supports 32-bit integers in practice.
Right, I think since it represents everything as 64-bit floats, 32-bit integers work because all 32-bit ints are easily representable exactly as a 64-bit float (53-bit mantissa) – but that would imply that bitwise operations, which are conceptually defined on the 32-bit ints which don’t necessarily exist “in memory” would need to unpack from a float representation, do the bitwise math, and then pack back?
So it’s a problem in theory, and perhaps in practice. The optimization mentioned in the comments probably applies mostly to non-escaping variables with local operations, not to arrays which I think usually have to use the 64-bit representation. So I’m curious if/how FastBitSet works around this (or if it does). Not curious enough to test it though since my JavaScript is … weak.
You have to take into account that modern JavaScript engines have a JIT compiler… and this compiler can tell apart floats from ints (from profiling).
My mental model is that current JavaScript engines can do heroic optimizations and JIT compile the code assuming that the numbers are 32-bit integers.
That is, when you create an array and only store 32-bit integers in it… you get roughly the same performance as if you’d used a typed array. (JavaScript has typed array which really are arrays of ints like in Java). That part of the story is something I know for sure!
It is easy enough to have access to the JIT code being generated… but such code is not easy to read.
It might be faster (maybe 4 ns/key) not to use a bit set, but split the range into sub-ranges of 2^n, and then hash integer entries of a sequence (“bit mixing“), with a hash function that is guaranteed to have no collisions. For a 2^64 range, that is:
static long[] createRandomUniqueListFast(int len, int seed) {
long[] list = new long[len];
for (int i = 0; i < len; i++) {
list[i] = hash64(seed + i);
}
return list;
}
static long hash64(long x) {
x = (x ^ (x >>> 30)) * 0xbf58476d1ce4e5b9L;
x = (x ^ (x >>> 27)) * 0x94d049bb133111ebL;
x = x ^ (x >>> 31);
return x;
}
Interesting approach: so these generators for a given n have a period of n and every value appears exactly once? Do they give up some randomness quality to achieve this behavior (clearly they aren’t uniformly random in the usual sense since once you’ve generated n-1 values, for example, the last value is exactly known)?
You mention “A similar algorithm can be used for smaller n” – how does it work, and can you generate it efficiently at runtime (i.e., can you take an arbitrary n, or do the possible n values need to be known at runtime?).
Yes, they should have a period of n, and every value appears once. The randomness quality is not all that great in my view. Quality could be improved by adding more iterations (one iteration is multiplication + shift), which would make it slower of course. “A similar algorithm can be used for smaller n”: I managed to implement that, but didn’t test the randomness quality (it’s probably quite bad). My pull request is merged in the example code. To improve randomness quality, a different algorithm could be used (eg. XTEA, or some other Feistel cipher).
My point is, it’s not needed to have a set (bit set or hash set). Some might not know this. (I think Daniel Lemire already knows this, and for him it was more important to show how bit sets / hash sets can be made faster.)
Travis Downssays:
Bleh, the blog ate my numbered list (despite the preview showing it OK). When I refer to point (1) and point (2) above, I’m referring to the paragraphs starting “The BitSet in Java and this one…” and “The memory use of the BitSet ” respectively.
The array backed FastBitSet will work great when n is small enough, in particular when it happens to fit in the L2 cache (or better). Here you use n = 1,000,000 which only requires ~122 KiB of memory (assuming the JavaScript FastBitSet works like the Java one – i.e., doesn’t have any optimization for sparse sets).
What happens for n = 800,000,000 and some m values (that don’t trigger the m * 1024 fallback condition? This should no longer fit in L2 (or even L3 unless you have some extreme type of CPU), and here the FastBitSet approach will probably fall behind (or at least the gap will narrow).
The naive solutions (certainly in Java, and it’s probably not better in JavaScript) are really hurt by the fact that HashSet is frankly a terrible set for integers, memory-wise: each Integer probably has an overhead of 2000% or so (say 80 bytes per 4 byte integer). A simple int[] based hash set would do much better (or there are plenty of good third party libraries too).
Yes, JavaScript FastBitSet works much like Java’s BitSet.
If you look at my code, in the post, you will notice that I do fall back on the regular, hash-based algorithm when a bitset is obviously inefficient. Look again:
See the “1024” factor? Well… 1024 is kind of arbitrary, you can tweak it.
Using the following script I estimate the memory usage of the JavaScript Set to be around 40 bytes per entry:
Obviously, 40 bytes is something like an order of magnitude more than needed.
It could be tricky to do better (though not impossible). JavaScript does have programmer-specified typed arrays, but they do not come for free performance-wise (which is odd but there you go). However, JavaScript is smart enough to recognize typed arrays by their usage. So it should be possible to do better…
Right, I saw the fallback code which is why I mentioned “[without triggering] the m * 1024 fallback condition” (let’s call n/m the “density factor”). If we say each integer takes 40 bytes as you found, then the cutover point for memory use would be around a density factor of 320 (since each integer uses 320 bits), not far off of 1024. The “performance” cutover point will generally be higher than the memory one (since even when bitset uses more memory it is simpler and will generally be faster), so perhaps 1024 is a good choice.
My points are more along the lines of:
The BitSet in Java and this one for JavaScipt happen to have a representation that is compact and “hardware sympathetic” compared to the ideal representation (i.e., very close to 1 actual bit used per bit), whereas the use of generic Set for an explicit set of integers tends to have quite a terrible representation, using 10x more memory than needed. So the BitSet approaches have a giant built-in lead in this comparison due to differences gap between actual and ideal implementation. It would be interesting to compare against Set implementations that didn’t suck so much for this purpose.
The memory use of the BitSet approach grows with n, while the Set approach grows with m. So I expect there to be some range where Set is better, but this range is much smaller than it has to be due to (1). In particular, there will be “plateaus” in the performance of each implementation, at different locations, based on the memory-intensity of the algorithm and the cache sizes. These will mostly become interesting when the working set moves out of L2 into L3 since I suspect out-of-order will have a generally easy time hiding even the L2 latency since both algorithms have “unlimited” MLP. As it happens n = 1,000,000 fits nicely into L2 for both algorithms, so it’s a point where expect the BitSet advantage to be relatively higher (i.e., the plateaus overlap here).
A few other notes:
There are really two separate things being tested here: the use of BitSet to store an integer set instead of Set, and the use of the negate() approach to make the problem almost symmetric around m == n/2, rather than very slow when m > n/2. The negate() thing I think is a common approach and also one you can/should apply to any algorithm (indeed, you might take it a step further and remove the brute-force “negate” entirely and propagate the “is negated” state to your consuming code which could often consume the negated set for free or nearly so).
The set representation question, on the other hand, I find a bit deeper and more interesting from a performance point of view. For example, one thing not mentioned is that these algorithms don’t return the same type of object – some return a Set and some a BitSet. Which is more convenient depends on the consumer: if you want to consumer the values in sorted order, BitSet does that “for free” as a consequence of how it is represented. If any old order is fine, however, then the Set representation is probably more efficient (but this is also subject to issue (1) above) since you can just iterate though the already-materialized values in underlying hash versus iterating through the BitSet (here there is probably also a cutover point depending on the density factor).
Indeed – doing better is just a matter of using an array of integers as the basis for your set – I’m quite sure that under the covers this is exactly what FastBitSet does. I’m not actually 100% sure how typed-arrays and bitwise math works in JavaScipt: internally number are using 8-bit floats – but does this allow you to use all 64-bits of a value as if it were a 64-bit integer to implement BitSet?
In Java it is clear how it works, mostly because you have the explicit distinction between primitive and boxed values. BitSet uses a long[] and so is pretty much the ideal representation (and bulk operations can even be vectorized) – and you have zero objects per “entry”. Set uses an Object[] pointing to Integer objects, so you have 2 objects and about ~54 bytes per entry. Fast integer set implementations use an int[] and have 0 objects and 4 bytes per entry (all numbers are the asymptotic limits – obviously the array itself counts as 1 object, and there are resizing considerations, but these go quickly to 0 on a per-element basis as the size increases).
Don’t take any of this as bashing BitSet – they are a great representation when sets are relative dense. I just want to make the tradeoff clear and point out that the fully optimized cutover point would be different than the “optimized vs not” cutover you might calculate from this post.
Finally, if you really want to do this quickly, I think a much faster approach for most density is as follows:
From your unbiased random bitstream (i.e., your source of randomness like random() or whatever), generate a biased bitstream where each bit is set with probability p’ ~= p where p is m/n, i.e., the probability that any given value will appear in the final set. I.e., you generate the BitSet representation directly, rather than starting with all zeros and setting bits one-by-one. This lets you, for example, generate the BitSet 64 bits (or 128 or 256 with SIMD) at a time, and the generating operation is much simpler and free of branches.
Now you’d have to be pretty lucky (like, really lucky for n = 1,000,000) for this to give you exactly the right m in the end, but if you don’t need exactly m values – but just something close you can stop here: the resultant number of set values (m’) will generally be very close to m (for example, with m=500,000 and n=1,000,000 we have m’ in the range m +/- 1000 about 95% of the time based on my rusty math and an online calculator – so usually within about 0.2% of the expected value).
Now, we should be fair and do apples to apples and get an exact m – so all you have to do is to apply any algorithm discussed in Daniel’s post to correct the selected values (either adding or removing) until we reach the correct m.
For density factors near 0.5, I’m will to bet this is an order of magnitude faster than the above approaches: and it only uses ~1 bit of entropy per value, rather than 32 (or whatever Random.getInt() uses per invocation). It is also very amenable to vectorization and other techniques as well. Note also that it is symmetric: you don’t really need the concept of negation because the behavior is naturally the same at m and (n – m), since we aren’t really distinguishing between chosen and unchosen values.
The performance at other density values is tougher to characterize: you need to generate the biased bit sequence, which is harder when p != 0.5, but still generally fast. In general you can pick a p’ different than p to optimize the overall performance since some p’ values are faster to generate (especially those that have the form 1/(2^x) for small x) and you will ultimately do a corrective step anyways. Generating the biases bit streams is an interesting topic in itself!
I think this approach is bias-free: every value is selected independently and the correction step seems bias-free also. If you are willing to accept small biases you can probably do it even faster.
This approach will be so fast that things like “how is your entropy generated” become important – if you use a slow RNG it will be the limiting factor! It’s also worth noting that the amount of entropy consumed, for the simple/fastest biased bitstream generation methods, depend on p in a “backwards” way: it is smallest near p = 0.5 and gets larger at the extremes, even though at the extremes the amount of information-theoretic entropy in the returned set is much lower (indeed, at m = 1 the other methods need only 1 random number, while this technique would need much more). So again there will be cutover points where this doesn’t make sense, which will depend on how expensive entropy is to generate.
(…) internally number are using 8-bit floats – but does this allow you to use all 64-bits of a value as if it were a 64-bit integer to implement BitSet?
JavaScript supports 32-bit integers in practice.
Generating the biases bit streams is an interesting topic in itself! (…) I think this approach is bias-free: every value is selected independently and the correction step seems bias-free also.
Agreed.
Right, I think since it represents everything as 64-bit floats, 32-bit integers work because all 32-bit ints are easily representable exactly as a 64-bit float (53-bit mantissa) – but that would imply that bitwise operations, which are conceptually defined on the 32-bit ints which don’t necessarily exist “in memory” would need to unpack from a float representation, do the bitwise math, and then pack back?
See this answer and the comments, for example:
https://stackoverflow.com/a/11159148/149138
So it’s a problem in theory, and perhaps in practice. The optimization mentioned in the comments probably applies mostly to non-escaping variables with local operations, not to arrays which I think usually have to use the 64-bit representation. So I’m curious if/how FastBitSet works around this (or if it does). Not curious enough to test it though since my JavaScript is … weak.
You have to take into account that modern JavaScript engines have a JIT compiler… and this compiler can tell apart floats from ints (from profiling).
My mental model is that current JavaScript engines can do heroic optimizations and JIT compile the code assuming that the numbers are 32-bit integers.
That is, when you create an array and only store 32-bit integers in it… you get roughly the same performance as if you’d used a typed array. (JavaScript has typed array which really are arrays of ints like in Java). That part of the story is something I know for sure!
It is easy enough to have access to the JIT code being generated… but such code is not easy to read.
JavaScript is fast!
It might be faster (maybe 4 ns/key) not to use a bit set, but split the range into sub-ranges of 2^n, and then hash integer entries of a sequence (“bit mixing“), with a hash function that is guaranteed to have no collisions. For a 2^64 range, that is:
static long[] createRandomUniqueListFast(int len, int seed) {
long[] list = new long[len];
for (int i = 0; i < len; i++) {
list[i] = hash64(seed + i);
}
return list;
}
static long hash64(long x) {
x = (x ^ (x >>> 30)) * 0xbf58476d1ce4e5b9L;
x = (x ^ (x >>> 27)) * 0x94d049bb133111ebL;
x = x ^ (x >>> 31);
return x;
}
A similar algorithm can be used for smaller n. I used that recently to generate lots of unique random numbers in somewhat sorted order (also split the range into randomly sized sub-ranges, and used the bit mixing trick for small ranges).
Interesting approach: so these generators for a given n have a period of n and every value appears exactly once? Do they give up some randomness quality to achieve this behavior (clearly they aren’t uniformly random in the usual sense since once you’ve generated n-1 values, for example, the last value is exactly known)?
You mention “A similar algorithm can be used for smaller n” – how does it work, and can you generate it efficiently at runtime (i.e., can you take an arbitrary n, or do the possible n values need to be known at runtime?).
Yes, they should have a period of n, and every value appears once. The randomness quality is not all that great in my view. Quality could be improved by adding more iterations (one iteration is multiplication + shift), which would make it slower of course. “A similar algorithm can be used for smaller n”: I managed to implement that, but didn’t test the randomness quality (it’s probably quite bad). My pull request is merged in the example code. To improve randomness quality, a different algorithm could be used (eg. XTEA, or some other Feistel cipher).
My point is, it’s not needed to have a set (bit set or hash set). Some might not know this. (I think Daniel Lemire already knows this, and for him it was more important to show how bit sets / hash sets can be made faster.)
Bleh, the blog ate my numbered list (despite the preview showing it OK). When I refer to point (1) and point (2) above, I’m referring to the paragraphs starting “The BitSet in Java and this one…” and “The memory use of the BitSet ” respectively.