Let me give Josh Bleecher Snyder credit for the idea that int_log2 splits the input into ranges where log10 can only go up by 1 at most. Within each such interval the problem is reduced to comparing against a table-derived threshold and incrementing if greater.
foobarsays:
Just a quick idea: wouldn’t zeroing three bottom bits of int_log2 result and indexing array (without the multiplier 8) also create a possibility of a smaller table? Sure this would increase latency by one clock cycle, but just in case the size of the table annoys somebody…
foobarsays:
Err, not quite. It would probably be meaningful to try out those ideas before writing a comment…
The comment in question is “As a tiny optimisation to make the tables three entries smaller: within int_log2, replace |1 with |8 – although then relying on C compiler to rewrite table[int_log2(x)-3] as (table-3)[int_log2(x)] and form table-3 as a constant.”
Josh Bleecher Snydersays:
Very nice! Three cheers for collaboration. 🙂
Thomas Müller Grafsays:
I wonder if this very nice trick could be used for other problems, e.g. fast square root, specially the integer square root.
Tushar Balisays:
exciting stuff
George Spelvinsays:
I still prefer the first solution because it extends to 64 bits naturally. The second solution can be considered a variant of the first where:
* The high half of the 9*bits/32 approximation is replaced by a table lookup,
* The threshold table lookup is indexed by number of bits, not number of digits (making it larger by a factor of log2(10) = 3.32), and
* The threshold comparison is done in the (semantically equivalent) form of an add with carry to the low half of the table.
Bloating the table by a factor or 6.64 (for 32 bits, from 114 = 44 bytes to 328 = 256 bytes, or +3 cache lines) doesn’t seem like a good use of L1 cache, especially as any digit-counting function is going to be part of some much larger output formatting code and not an inner loop by itself.
There’s one more trick that hasn’t shown up yet and might be useful. Given a safe underestimate such as
y = 9*int_log2(x) / 32;
then x >> y can be used to compute the number of digits without losing accuracy; the low digit_count(x) bits of x don’t affect the result (because 10 is a multiple of 2).
This might enable people to fit the necessary low and high parts of each table entry into a single word.
jk-jeonsays:
Okay, I actually tried George Spelvin’s idea and got this: https://godbolt.org/z/3h971c8K8.
So difference between two implementations are that for the “fused” table lookup, we have
and for the normal method explained in Prof. Lemire’s previous blog post, we have
cmp DWORD PTR table[0+rdx*4], edi
adc eax, 1
I believe the first one might be very slightly faster as adc instruction is usually a bit slower than add. (But really, there should be no real difference.) I tested through quick-bench.com and it seems that GCC slightly favors the first one and Clang slightly favors the second one, but I think this is unfair, since quick-bench.com does not let me to specify the architecture flag, thus it generates shr eax, cl instead of shrx eax, edi, eax, which I believe is usually a bit slower.
KWilletssays:
I was able to get the table down to 32 bit entries by using lg2/4 as the shift — table[lg2]<<lg2/4 gives the original 64-bit entry.
The arithmetic after that still has to be wider than the argument, ie 64-bit would need uint_128t. I’m trying to shift the argument right instead but I have a few edge cases.
for(int i =0; i < 64; ++i) {
auto const ub =std::uint64_t(ceil_log10_pow2(i));
assert(ub <=19);table.entry[i]=((ub +1)<<52)-(pow10[ub]>>(i /4));}
return table;}
constexpr inline auto digit_count_table = generate_digit_count_table();
int floor_log2(std::uint64_t n) noexcept {
return 63^ __builtin_clzll(n);}
int count_digit(std::uint64_t n) noexcept {
return int((digit_count_table.entry[floor_log2(n)]+(n >>(floor_log2(n)/4)))>>52);}
Thomas Müller Grafsays:
Just a crazy idea… As far as I understand, Intel CPUs have (limited) BCD support. I guess using it just to calculate the number of decimal digits is not all that fast (would need to be tested). But would “itoa” be faster that way?
Thomas Müller Grafsays:
I just tested this approach, and on my machine it is terribly slow. But it works 🙂
Just FYI, I implemented a simple version of this “use BCD opcode” idea in the itoa-benchmark, and this seems to be about 10 times slower than the fastest implementation of itoa. It probably is one of the shortest (in number of CPU instructions), but the slowest.
Another fun way to get approximate log10 is to put the increments in a bitmap (1 means log10 goes up in that range), and use CLZ and a shift to discard the bits above lg2. The popcount of the masked-off bits is the approximate log10 similar to 9/32*lg2.
I don’t think this is faster than multiply and shift; the only possible time difference is multiply vs. popcount.
JBsays:
I would observe that the definition of the lookup table is missing a “const” qualifier.
I would also want to look at where the table is placed in memory by the loader: if it’s in a DATA section, it is necessarily going to live in a separate Read-Only page, and “table[idx]” is a memory reference that might go through a page fault.
Since for 32 bits the table has only a few entries, changing the lookup table to a switch statement might in fact be faster (if only because the search in the table would in fact only reference memory that is already in I-cache). This would need to be benchmarked, obviously.
I would also want to look at where the table is placed in memory by
the loader: if it’s in a DATA section, it is necessarily going to live
in a separate Read-Only page, and “table[idx]” is a memory reference
that might go through a page fault.
Do you know of an optimizing C compiler that would build the code in such a manner? That would seem incredibly wasteful.
Since for 32 bits the table has only a few entries, changing the
lookup table to a switch statement might in fact be faster (if only
because the search in the table would in fact only reference memory
that is already in I-cache).
I expect most optimizing compilers to turn a 32-entry switch/case into a table lookup. There might be exceptions, evidently, but my experience has been that there compilers are not shy about using tables to implement switch/case code.
camel-cdrsays:
I found that a naive approach (without a lookup table) actually beats all of the more advanced ones presented here, see https://godbolt.org/z/5W4xfqjGj:
I’m guessing that this is due to accessing memory being expensive. Maybe I’ve done something wrong with the benchmark, please correct me if something is wrong.
camel-cdrsays:
Actually, I’ve put the relevant benchmark into a quick-bench version, that should be easier to understand than my benchmarking library:
Ok, I think I get it now.
I generate high entropy random numbers and thus numbers are likely very large, which means that the branch predictor can be quite accurate.
If I change the random number generation to
1 << (bench_hash64(i)&63);
the naive approach is now much slower.
I assume that the usual use case doesn’t call the function repeatedly, so the branch predictor will have a harder time, although this should probably be benchmarked with a real world use case.
There are definitively cases where the number of digits is predictable and in such cases, a branchy approach will be faster. Your scenario is one of those.
CBsays:
A classic case of benchmarking your benchmark. The majority of “benchmarks” I see online fall into that category.
Kassays:
I have this idea but can’t figure it out yet, i am trying to halve the table or completely removing it, will be possible because by shifting the index of highest bit to the right by 2 (division by 4) we end up with value less than 16, the problem is we have is the repeated values (0,4,9,14.), if there is a mathematical trick to remove (differentiate) these repeated values, we can either minimize the table into 15-16 items or remove the table completely by building a value from our new index and then compare, if a cheap trick can be found then it will be table-less and branchless algorithm, a cheap trick like shift the index by one then subtract again
Now the table can be only 18 entry, the multiplication above i think can be replaced with better approach, but it is not my main target, the target is to remove the table now by either of :
1) Can we calculate 10^n (for 100… or 999…) with few cycles, will it be worth removing memory access.
2) The fact of the decimal 10 is 4 bits means 10^n always is far by 3 or 4 bits from 10^(n+1) and 10^(n-1), means we have 2 bits to maneuverer between where x is equal or bigger 2^(Log10(x)) yet is is smaller then 2^(Log10(x)+1), this is coming from x/10 < x/8, so by changing a the simple compare against value to compare against a rank of neighbors, can guess the length the rank of x.
3) Because we don’t need full proven mathematical theory to for any n all what we need is magical numbers just like the mult by 37, so we need these 18 values to be calculated or approximated to fix one single profile of usage, looking at the bits of these values (10,100… and 9,99,999…) we see a nice pattern where all bits after than the highest one is bigger than previous value except for one single case the one between 15-16 in 99xx.. table or 16-17 in 10xx.. ,now trying to apply the logic like the above case (2), the following is true for all except that one case x-2^(Log2(x)) > 10^(Log10(x)-1), is there a way to use this ?
Let me give Josh Bleecher Snyder credit for the idea that int_log2 splits the input into ranges where log10 can only go up by 1 at most. Within each such interval the problem is reduced to comparing against a table-derived threshold and incrementing if greater.
Just a quick idea: wouldn’t zeroing three bottom bits of int_log2 result and indexing array (without the multiplier 8) also create a possibility of a smaller table? Sure this would increase latency by one clock cycle, but just in case the size of the table annoys somebody…
Err, not quite. It would probably be meaningful to try out those ideas before writing a comment…
Relatedly, see this suggestion by Pete Cawley: https://mobile.twitter.com/i/timeline
The comment in question is “As a tiny optimisation to make the tables three entries smaller: within
int_log2
, replace|1
with|8
– although then relying on C compiler to rewritetable[int_log2(x)-3]
as(table-3)[int_log2(x)]
and formtable-3
as a constant.”Very nice! Three cheers for collaboration. 🙂
I wonder if this very nice trick could be used for other problems, e.g. fast square root, specially the integer square root.
exciting stuff
I still prefer the first solution because it extends to 64 bits naturally. The second solution can be considered a variant of the first where:
* The high half of the 9*bits/32 approximation is replaced by a table lookup,
* The threshold table lookup is indexed by number of bits, not number of digits (making it larger by a factor of log2(10) = 3.32), and
* The threshold comparison is done in the (semantically equivalent) form of an add with carry to the low half of the table.
Bloating the table by a factor or 6.64 (for 32 bits, from 114 = 44 bytes to 328 = 256 bytes, or +3 cache lines) doesn’t seem like a good use of L1 cache, especially as any digit-counting function is going to be part of some much larger output formatting code and not an inner loop by itself.
There’s one more trick that hasn’t shown up yet and might be useful. Given a safe underestimate such as
y = 9*int_log2(x) / 32;
then x >> y can be used to compute the number of digits without losing accuracy; the low digit_count(x) bits of x don’t affect the result (because 10 is a multiple of 2).
This might enable people to fit the necessary low and high parts of each table entry into a single word.
Okay, I actually tried George Spelvin’s idea and got this: https://godbolt.org/z/3h971c8K8.
So difference between two implementations are that for the “fused” table lookup, we have
shrx eax, edi, eax
add eax, DWORD PTR table[0+rdx*4]
shr eax, 26
and for the normal method explained in Prof. Lemire’s previous blog post, we have
cmp DWORD PTR table[0+rdx*4], edi
adc eax, 1
I believe the first one might be very slightly faster as
adc
instruction is usually a bit slower thanadd
. (But really, there should be no real difference.) I tested through quick-bench.com and it seems that GCC slightly favors the first one and Clang slightly favors the second one, but I think this is unfair, since quick-bench.com does not let me to specify the architecture flag, thus it generatesshr eax, cl
instead ofshrx eax, edi, eax
, which I believe is usually a bit slower.I was able to get the table down to 32 bit entries by using lg2/4 as the shift — table[lg2]<<lg2/4 gives the original 64-bit entry.
The arithmetic after that still has to be wider than the argument, ie 64-bit would need uint_128t. I’m trying to shift the argument right instead but I have a few edge cases.
FYI, here is an application of the idea presented in this post into the 64-bit case: https://godbolt.org/z/e76eEKKj3
Just a crazy idea… As far as I understand, Intel CPUs have (limited) BCD support. I guess using it just to calculate the number of decimal digits is not all that fast (would need to be tested). But would “itoa” be faster that way?
I just tested this approach, and on my machine it is terribly slow. But it works 🙂
int slow_digit_count(uint32_t x) {
double d = x;
uint64_t y = 0;
asm("fbstp %0" : "=m" (y) : "t" (d) : "st");
return (67 - __builtin_clzll(y | 1)) >> 2;
}
I doubt that such a “itoa” method would be fast.
Just FYI, I implemented a simple version of this “use BCD opcode” idea in the itoa-benchmark, and this seems to be about 10 times slower than the fastest implementation of itoa. It probably is one of the shortest (in number of CPU instructions), but the slowest.
t_inline void uint32toa_bcd(uint64_t x, char* out) {
double d = x;
asm("fbstp %0" : "=m" (out[0]) : "t" (d) : "st");
uint64_t y;
memcpy(&y, out, sizeof(y));
int len = (67 - __builtin_clzll(y | 1)) >> 2;
out[len] = 0;
while (len > 0) {
out[--len] = '0' + (y & 0xf);
y >>= 4;
}
}
Another fun way to get approximate log10 is to put the increments in a bitmap (1 means log10 goes up in that range), and use CLZ and a shift to discard the bits above lg2. The popcount of the masked-off bits is the approximate log10 similar to 9/32*lg2.
I don’t think this is faster than multiply and shift; the only possible time difference is multiply vs. popcount.
I would observe that the definition of the lookup table is missing a “const” qualifier.
I would also want to look at where the table is placed in memory by the loader: if it’s in a DATA section, it is necessarily going to live in a separate Read-Only page, and “table[idx]” is a memory reference that might go through a page fault.
Since for 32 bits the table has only a few entries, changing the lookup table to a switch statement might in fact be faster (if only because the search in the table would in fact only reference memory that is already in I-cache). This would need to be benchmarked, obviously.
Do you know of an optimizing C compiler that would build the code in such a manner? That would seem incredibly wasteful.
I expect most optimizing compilers to turn a 32-entry switch/case into a table lookup. There might be exceptions, evidently, but my experience has been that there compilers are not shy about using tables to implement switch/case code.
I found that a naive approach (without a lookup table) actually beats all of the more advanced ones presented here, see https://godbolt.org/z/5W4xfqjGj:
static inline int
bh_ilog10_32_naive(uint32_t x)
{
return (x < 10 ? 1 :
(x < 100 ? 2 :
(x < 1000 ? 3 :
(x < 10000 ? 4 :
(x < 100000 ? 5 :
(x < 1000000 ? 6 :
(x < 10000000 ? 7 :
(x < 100000000 ? 8 :
(x < 1000000000 ? 9 :
10)))))))));
}
Result from local runs:
bh_ilog10_32_naive: 1.000000e+00 +/- 1e-05
bh_ilog10_32d: 1.452496e+00 +/- 5e-06
bh_ilog10_32b: 1.691123e+00 +/- 1e-05
bh_ilog10_32c: 2.195664e+00 +/- 3e-05
bh_ilog10_32a: 2.212707e+00 +/- 3e-05
I’m guessing that this is due to accessing memory being expensive. Maybe I’ve done something wrong with the benchmark, please correct me if something is wrong.
Actually, I’ve put the relevant benchmark into a quick-bench version, that should be easier to understand than my benchmarking library:
https://www.quick-bench.com/q/QZh-vs3H77D7hDvJL8Vn3FJNPj4
This still has the same result.
Ok, I think I get it now.
I generate high entropy random numbers and thus numbers are likely very large, which means that the branch predictor can be quite accurate.
If I change the random number generation to
1 << (bench_hash64(i)&63);
the naive approach is now much slower.
I assume that the usual use case doesn’t call the function repeatedly, so the branch predictor will have a harder time, although this should probably be benchmarked with a real world use case.
There are definitively cases where the number of digits is predictable and in such cases, a branchy approach will be faster. Your scenario is one of those.
A classic case of benchmarking your benchmark. The majority of “benchmarks” I see online fall into that category.
I have this idea but can’t figure it out yet, i am trying to halve the table or completely removing it, will be possible because by shifting the index of highest bit to the right by 2 (division by 4) we end up with value less than 16, the problem is we have is the repeated values (0,4,9,14.), if there is a mathematical trick to remove (differentiate) these repeated values, we can either minimize the table into 15-16 items or remove the table completely by building a value from our new index and then compare, if a cheap trick can be found then it will be table-less and branchless algorithm, a cheap trick like shift the index by one then subtract again
Found a cheap one but with multiplication, the last column is
Log2(x) * 37 >> 7
and here the result
1 : 1 0 0 0
10 : 2 3 1 0
100 : 3 6 3 1
1000 : 4 9 4 2
10000 : 5 13 6 3
100000 : 6 16 8 4
1000000 : 7 19 9 5
10000000 : 8 23 11 6
100000000 : 9 26 13 7
1000000000 : 10 29 14 8
10000000000 : 11 33 16 9
100000000000 : 12 36 18 10
1000000000000 : 13 39 19 11
10000000000000 : 14 43 21 12
100000000000000 : 15 46 23 13
1000000000000000 : 16 49 24 14
10000000000000000 : 17 53 26 15
100000000000000000 : 18 56 28 16
1000000000000000000 : 19 59 29 17
9 : 1 3 0 0
99 : 2 6 1 1
999 : 3 9 2 2
9999 : 4 13 3 3
99999 : 5 16 4 4
999999 : 6 19 4 5
9999999 : 7 23 5 6
99999999 : 8 26 6 7
999999999 : 9 29 7 8
9999999999 : 10 33 8 9
99999999999 : 11 36 9 10
999999999999 : 12 39 9 11
9999999999999 : 13 43 10 12
99999999999999 : 14 46 11 13
999999999999999 : 15 49 12 14
9999999999999999 : 16 53 13 15
99999999999999999 : 17 56 14 16
999999999999999999 : 18 59 14 17
Now the table can be only 18 entry, the multiplication above i think can be replaced with better approach, but it is not my main target, the target is to remove the table now by either of :
1) Can we calculate 10^n (for 100… or 999…) with few cycles, will it be worth removing memory access.
2) The fact of the decimal 10 is 4 bits means 10^n always is far by 3 or 4 bits from 10^(n+1) and 10^(n-1), means we have 2 bits to maneuverer between where x is equal or bigger 2^(Log10(x)) yet is is smaller then 2^(Log10(x)+1), this is coming from x/10 < x/8, so by changing a the simple compare against value to compare against a rank of neighbors, can guess the length the rank of x.
3) Because we don’t need full proven mathematical theory to for any n all what we need is magical numbers just like the mult by 37, so we need these 18 values to be calculated or approximated to fix one single profile of usage, looking at the bits of these values (10,100… and 9,99,999…) we see a nice pattern where all bits after than the highest one is bigger than previous value except for one single case the one between 15-16 in 99xx.. table or 16-17 in 10xx.. ,now trying to apply the logic like the above case (2), the following is true for all except that one case x-2^(Log2(x)) > 10^(Log10(x)-1), is there a way to use this ?
I was able to simplify int_log2 function to just counting of leading zero bits by revering the table and adding an additional table value for 0:
https://github.com/plokhotnyuk/jsoniter-scala/pull/749/files#diff-4e32fbdaa814b6b0b8e22a1b1440bb2928e50a6e1482e0d56c08d606812b625bR1982
Thanks for the link.
Here is a table with length == 65 that used to calculate digit count for up to 58-bit numbers using the following function:
def f(x: Long): Int = (offsets(java.lang.Long.numberOfLeadingZeros(x)) + x >> 58).toInt