I have arrived at n=floor( log10(integer) +1 )
Finding a proof of this formula would be interesting. I guess derivation of it is the proof itself.
python version is, of course, len(str(integer))
George Spelvinsays:
I have arrived at n=floor( log10(integer) +1 )
Note that this cannot work for integers more than 2^53 or so, which cannot be represented exactly by IEEE double. The IEEE double-precision format can’t distinguish between 10^17 – 1 = 0x16345785d89ffff and 10^17 = 0x16345785d8a0000. (In fact, it doesn’t work for me for 10^15 – 1, but that depends on the vagaries of the log10() implementation.)
Please note that this may give a wrong answer for 0 because the result of __builtin_clz for 0 is undefined. A common fix is to OR __builtin_clz’s argument with 1.
camel-cdrsays:
I found that the assembly of
int digit_count(uint32_t x) {
static uint32_t table[] = {9, 99, 999, 9999, 99999, 999999, 9999999, 99999999, 999999999};
int y = (9 * (31 - __builtin_clz(x|1))) >> 5;
y += x > table[y];
return y + 1;
}
You are correct, it should reduce the latency. I think it works with both x64 and other ISAs as well (including ARM).
Josh Bleecher Snydersays:
Thanks for ruining my Saturday morning. 😛
Here’s a digit count that takes a slightly different approach: Fix up the input to int_log2 so that the rounding doesn’t matter. When compiled with clang 12.0.0, it shaves an extra instruction off. Note that to get good compiled code, I had to add -O3 and change the definition of int_log2 a bit. (Two changes: Move |1 to the caller for more control, and use 31^ instead of 31- so that the compiler could do a better elimination job.)
int int_log2(uint32_t x) { return 31 ^ __builtin_clz(x); }
This us, unfortunately, silly. You’re calling ilog2 twice and using a larger table (one entry per bit rather than one per digit). All to save one conditional increment (e.g. compare, SETcc, add, or subtract, shift-right, add).
Josh Bleecher Snydersays:
It may be many things, but I do not believe it is silly.
On my M1 laptop, it performs incrementally better than the post’s code. To run the entire 32 bit range, the 9*x>>5 approach takes 2.75s; this takes 2.67s.
On my 2016 intel laptop, it performs worse, but I believe that that is because of a compiler shortcoming. When I use inline assembly thus:
I get the original code takes 4.09s to process all 32 bit integers, and the asm takes 3.03s. (This isn’t entirely fair, as I made no attempt to hand-optimize the original code. But I think it is enough to establish that this isn’t silly.)
It is unfortunate that BSR has such a significant latency.
Even if this wasn’t faster at all, I think it is interesting, because it is a fairly different kind of approach. It stemmed from the observation that the lookup table isn’t doing that much work. I was curious whether the lookup table could eliminate much of the extra calculations, which it turns out it can (no increments, lets us use x / 4 instead of 9x / 32).
mov ecx, edi is redundant there and you could probably get rid of it by casting to 32-bits in a strategic place. It’s just a quirk of the ABI that the high bits might contain garbage but in this case it’s harmless because it’s used as a shift count which has an implicit mask anyway.
Nevermind, it’s not redundant, but it is just a quirk of the x86-64 ABI for passing 32-bit values in a 64-bit register. You could expect it to go away after inlining or if you made the parameter x a 64-bit value (but the top 32 bits must be zero).
I plan to write up an explanation of how this works on my own blog
post soon(ish).
Please either ping me back or just add a comment here… so I can edit the blog post to refer to your blog post. (We want people who arrive at my blog post to find yours.)
KWilletssays:
I see – you’re moving the step in decimal log to where the step in binary (actually base 16) log is. This technique should work for any set of ranges where log10 changes by at most 1 in each, ie where the end is less than 10 times the start.
If we work with log base 8 for example, we can have one entry for 1-7, the next for 8-63, 64-511, and so on.
Thomas Müller Grafsays:
Where the “if (v >= p10)” can result in an a slow branch, then (9 * l2) >>> 5 might not be best. The following results in fewer cases of “ans++” over the whole integer range (116 million instead of 1.6 billion).
int l2 = 64 - Long.numberOfLeadingZeros(num | 1);
int ans = (21845 * l2) >>> 16;
long p10 = POW_10[ans];
if (num >= p10) {
ans++;
}
return ans;
I expect that the branch gets compiled to a conditional move… so that there are no mispredicted branches… don’t you have the same expectation?
Thomas Müller Grafsays:
Yes, I also hope a conditional move is possible here. For this case, I assume it’s more important to have multiply by 5 (or 9, 17,…). Only if a conditional move isn’t possible, reducing the probability of branches makes sense.
Travis Downssays:
Well if the probability of needing the fixup is low enough, then it would be better to use a branch instead and move this part off of the critial path (control dependency vs a data dependency).
This would be provide a nice latency improvement at the cost of the occasional mispredict.
No, I think it does either, just sort of mentioning this for completeness since it did come up in some other approaches I considered.
I don’t think any approach which is purely a function of lzcnt(num) can achieve low correction probability because too much information has been lost in the truncation.
I am sure that there must be a clever way to get this sort of result.
KWilletssays:
Last night I realized that Josh’s approach can be modified to use a simpler function after the table mapping. More specifically, the table entry has the integer log10 in the upper 32 bits, and the lower 32 generate a carry when the appropriate threshold is reached, eg the entry around 100 is (3<<32) -100. We only need a >>32 to pull out the integer log after the add:
I am wondering if a fast solution exists for uint64_t numbers.
I am also interested, if there is fast solution to count the digits with over-estimation. What I mean is let say input is 12345, then the output can be 5,6,7 or even 20.
This is very handy, if you have a digit, you want to convert it to string, but you need to know in advance how much memory bytes you need to allocate for the string. Think about 1000’s of strings.
I thought I can calculate using octal numbers, however it not as easy as it may sound.
I have arrived at n=floor( log10(integer) +1 )
Finding a proof of this formula would be interesting. I guess derivation of it is the proof itself.
python version is, of course, len(str(integer))
Note that this cannot work for integers more than 2^53 or so, which cannot be represented exactly by IEEE double. The IEEE double-precision format can’t distinguish between 10^17 – 1 = 0x16345785d89ffff and 10^17 = 0x16345785d8a0000. (In fact, it doesn’t work for me for 10^15 – 1, but that depends on the vagaries of the log10() implementation.)
(Also, floating-point logarithms are not quick.)
Please note that this may give a wrong answer for 0 because the result of __builtin_clz for 0 is undefined. A common fix is to OR __builtin_clz’s argument with 1.
I found that the assembly of
int digit_count(uint32_t x) {
static uint32_t table[] = {9, 99, 999, 9999, 99999, 999999, 9999999, 99999999, 999999999};
int y = (9 * (31 - __builtin_clz(x|1))) >> 5;
y += x > table[y];
return y + 1;
}
looks a slightly faster, because the multiply by 9 can be done using lea instead of imul: https://godbolt.org/z/rGx43MzTs
You are correct, it should reduce the latency. I think it works with both x64 and other ISAs as well (including ARM).
Thanks for ruining my Saturday morning. 😛
Here’s a digit count that takes a slightly different approach: Fix up the input to
int_log2
so that the rounding doesn’t matter. When compiled with clang 12.0.0, it shaves an extra instruction off. Note that to get good compiled code, I had to add-O3
and change the definition ofint_log2
a bit. (Two changes: Move|1
to the caller for more control, and use31^
instead of31-
so that the compiler could do a better elimination job.)int int_log2(uint32_t x) { return 31 ^ __builtin_clz(x); }
int digit_count(uint32_t x) {
static uint32_t table[] = {
0, 0, 0, (1<<4) - 10,
0, 0, (1<<7) - 100,
0, 0, (1<<10) - 1000,
0, 0, 0, (1<<14) - 10000,
0, 0, (1<<17) - 100000,
0, 0, (1<<20) - 1000000,
0, 0, 0, (1<<24) - 10000000,
0, 0, (1<<27) - 100000000,
0, 0, (1<<30) - 1000000000,
0, 0,
};
int l2 = int_log2(x);
x += table[l2];
l2 = int_log2(x|1);
int ans = (77*l2)>>8;
return ans + 1;
}
Perhaps it can be improved further.
I also have a reasonably optimized Go implementation that I’ll probably publish soon. (I’ll leave a note here if/when I do.)
https://godbolt.org/z/W8Wc1bK3M
This us, unfortunately, silly. You’re calling ilog2 twice and using a larger table (one entry per bit rather than one per digit). All to save one conditional increment (e.g. compare, SETcc, add, or subtract, shift-right, add).
It may be many things, but I do not believe it is silly.
On my M1 laptop, it performs incrementally better than the post’s code. To run the entire 32 bit range, the 9*x>>5 approach takes 2.75s; this takes 2.67s.
On my 2016 intel laptop, it performs worse, but I believe that that is because of a compiler shortcoming. When I use inline assembly thus:
int in_asm(uint32_t x) {
static uint64_t table[] = {16, 14, 12, 246, 240, 224, 3996, 3968, 3840, 64536, 64512, 63488, 61440, 1038576, 1032192, 1015808, 16677216, 16646144, 16515072, 267435456, 267386880, 266338304, 264241152, 4284967296, 4278190080, 4261412864, 68619476736, 68585259008, 68451041280, 1098511627776, 1098437885952, 1097364144128};
int ret;
__asm__ (
"movl %%eax, %%edx;"
"orl $1, %%edx;"
"bsrl %%edx, %%edx;"
"addq (%%rbx,%%rdx,8), %%rax;"
"bsrq %%rax, %%rax;"
"sarl $2, %%eax;"
: "=a" (ret)
: "a" (x), "b" (table)
: "dx"
);
return ret;
}
I get the original code takes 4.09s to process all 32 bit integers, and the asm takes 3.03s. (This isn’t entirely fair, as I made no attempt to hand-optimize the original code. But I think it is enough to establish that this isn’t silly.)
It is unfortunate that BSR has such a significant latency.
Even if this wasn’t faster at all, I think it is interesting, because it is a fairly different kind of approach. It stemmed from the observation that the lookup table isn’t doing that much work. I was curious whether the lookup table could eliminate much of the extra calculations, which it turns out it can (no increments, lets us use x / 4 instead of 9x / 32).
It looks like AMD processors (Zen+, Zen2 and Zen3) have the lzcnt instruction with a short (1 cycle) latency.
I was surprised at the 3 cycle latency for bsr. I somehow assumed in my head that it was a 1-cycle instruction.
Go package: https://github.com/josharian/log10
There’s more to do on it, but I’ve already wasted too much of the day…
There’s a relatively large number of github repositories that fork https://github.com/miloyip/itoa-benchmark . I also spent some time on this challenge. I have used a very similar approach: https://github.com/thomasmueller/itoa-benchmark/blob/master/src/tmueller.cpp#L108
int zeros = 64 - __builtin_clzl(v);
int len = (1233 * zeros) >> 12;
uint64_t p10 = POW_10[len];
if (v >= p10) {
len++;
}
Just for fun I would try this approach… use : log10(x) = log2(x) / log2(10)
const float Recip_Log2_10 = 1f / log2( 10f );
digit_count( x ) = 1 + (int)( log2(x) * Recip_Log2_10 )
But instead of floating math do it in fixed-point?
const int Recip_Log2_10 = (int)(( 1f / log2( 10f ) << 24 )
digit_count( x ) = 1 + ( log2(x) * Recip_Log2_10 ) >> 24
Its is on 99% total bulshit or at least 2 order slower… but that me 🙂
Congrats, you ruined Sunday, too. 🙂
I got it down to 5 instructions! (Well, it should be five, but clang and gcc each emit 6, because each sees an optimization the other missed.)
bsr eax, edi
mov ecx, edi
add rcx, qword ptr [8*rax + j_digit_count.table]
bsr rax, rcx
sar eax, 2
(gcc emits two movs after the first bsr; clang emits xor shr xor instead of sar.)
Code is:
int int_log2_64(uint64_t x) { return 63 ^ __builtin_clzll(x); }
int digit_count(uint32_t x) {
static uint64_t table[] = {16, 14, 12, 246, 240, 224, 3996, 3968, 3840, 64536, 64512, 63488, 61440, 1038576, 1032192, 1015808, 16677216, 16646144, 16515072, 267435456, 267386880, 266338304, 264241152, 4284967296, 4278190080, 4261412864, 68619476736, 68585259008, 68451041280, 1098511627776, 1098437885952, 1097364144128};
int lg2 = int_log2(x);
uint64_t n = (uint64_t)(x) + table[lg2];
return int_log2_64(n) >> 2;
}
I plan to write up an explanation of how this works on my own blog post soon(ish).
mov ecx, edi
is redundant there and you could probably get rid of it by casting to 32-bits in a strategic place. It’s just a quirk of the ABI that the high bits might contain garbage but in this case it’s harmless because it’s used as a shift count which has an implicit mask anyway.Nevermind, it’s not redundant, but it is just a quirk of the x86-64 ABI for passing 32-bit values in a 64-bit register. You could expect it to go away after inlining or if you made the parameter x a 64-bit value (but the top 32 bits must be zero).
This is really clever, BTW.
Please either ping me back or just add a comment here… so I can edit the blog post to refer to your blog post. (We want people who arrive at my blog post to find yours.)
I see – you’re moving the step in decimal log to where the step in binary (actually base 16) log is. This technique should work for any set of ranges where log10 changes by at most 1 in each, ie where the end is less than 10 times the start.
If we work with log base 8 for example, we can have one entry for 1-7, the next for 8-63, 64-511, and so on.
Where the “if (v >= p10)” can result in an a slow branch, then (9 * l2) >>> 5 might not be best. The following results in fewer cases of “ans++” over the whole integer range (116 million instead of 1.6 billion).
int l2 = 64 - Long.numberOfLeadingZeros(num | 1);
int ans = (21845 * l2) >>> 16;
long p10 = POW_10[ans];
if (num >= p10) {
ans++;
}
return ans;
I expect that the branch gets compiled to a conditional move… so that there are no mispredicted branches… don’t you have the same expectation?
Yes, I also hope a conditional move is possible here. For this case, I assume it’s more important to have multiply by 5 (or 9, 17,…). Only if a conditional move isn’t possible, reducing the probability of branches makes sense.
Well if the probability of needing the fixup is low enough, then it would be better to use a branch instead and move this part off of the critial path (control dependency vs a data dependency).
This would be provide a nice latency improvement at the cost of the occasional mispredict.
Yes. That’s a good point, but I am not convinced that Thomas’ code has this effect. Maybe I just misunderstand it.
No, I think it does either, just sort of mentioning this for completeness since it did come up in some other approaches I considered.
I don’t think any approach which is purely a function of lzcnt(num) can achieve low correction probability because too much information has been lost in the truncation.
I am sure that there must be a clever way to get this sort of result.
Last night I realized that Josh’s approach can be modified to use a simpler function after the table mapping. More specifically, the table entry has the integer log10 in the upper 32 bits, and the lower 32 generate a carry when the appropriate threshold is reached, eg the entry around 100 is (3<<32) -100. We only need a >>32 to pull out the integer log after the add:
mov eax, edi
bsr rcx, rax
add rax, qword ptr [8*rcx + digit_count(unsigned int)::table]
shr rax, 32
(clang)
….!!!!!…..
Here is the source — pardon my macro:
int int_log2_64(uint64_t x) { return 63 ^ __builtin_clzll(x); }
// this increments the upper 32 bits (log10(T) - 1) when >= T is added
#define K(T) (((sizeof(#T)-1)<<32) - T)
int digit_count(uint32_t x) {
static uint64_t table[] = {
K(0),K(0),K(0),
K(10),K(10),K(10), // 64
K(100),K(100),K(100), // 512
K(1000),K(1000),K(1000), // 4096
K(10000),K(10000),K(10000), // 32k
K(100000),K(100000),K(100000), // 256k
K(1000000),K(1000000),K(1000000), // 2048k
K(10000000),K(10000000),K(10000000), // 16M
K(100000000),K(100000000),K(100000000), // 128M
K(1000000000),K(1000000000),K(1000000000), // 1024M
K(1000000000),K(1000000000) // 4B
};
int lg2 = int_log2_64(x);
uint64_t n = (uint64_t)(x) + table[lg2];
return n >> 32;
}
Oh. I already had reimplemented it. The idea is simple enough (though brilliant).
Blog post coming up.
Blog post at https://lemire.me/blog/2021/06/03/computing-the-number-of-digits-of-an-integer-even-faster/
My code differs slightly.
I am wondering if a fast solution exists for uint64_t numbers.
I am also interested, if there is fast solution to count the digits with over-estimation. What I mean is let say input is 12345, then the output can be 5,6,7 or even 20.
This is very handy, if you have a digit, you want to convert it to string, but you need to know in advance how much memory bytes you need to allocate for the string. Think about 1000’s of strings.
I thought I can calculate using octal numbers, however it not as easy as it may sound.