The Linux code also implements division-by-constant using multiplication manually. While most compilers these days know how to optimize divide by constant to a multiply and shift, they usually can’t infer the limited ranges of the inputs which allows smaller multipliers and no fixups.
aqritsays:
16-bit numbers need only one multiply per digit?
void lulz_atoi(char* str, uint16_t val) {
uint64_t lo = val;
uint64_t hi;
__uint128_t x = (__uint128_t)lo * ((0xFFFFFFFFFFFFFFFFULL / 10000) + 1);
hi = x >> 64;
lo = (uint64_t)x;
str[0] = hi + 0x30;
for (int i = 1; i > 64;
lo = (uint64_t)x;
Updated gist to 64-bits. I’ve not checked the generated assembly. Not benchmarked against the other implementations because a uint64_t should have 20 decimal digits…
Lance Richardsonsays:
ICX AVX2 numbers look pretty nice, although “-march=native” was needed to get all three SIMD versions.
# make
c++ -O3 -march=native -Wall -Wextra -std=c++17 -o convert convert.cpp
# ./convert
khuong 7.15067
backlinear 30.8381
linear 21.6466
tree 14.2078
treetst 10.0964
treest 6.32523
treebt 2.30575
sse2 4.81692
sse2(2) 4.70681
avx2 2.00375
khuong 7.15235
backlinear 30.8286
linear 21.6496
tree 14.2603
treetst 10.0969
treest 6.32516
treebt 2.30584
sse2 4.81617
sse2(2) 4.70639
avx2 2.00354
khuong 7.15085
backlinear 30.8319
linear 21.6403
tree 14.2665
treetst 10.0935
treest 6.32525
treebt 2.30579
sse2 4.81627
sse2(2) 4.70653
avx2 2.00359
Lance Richardsonsays:
ICX AVX2 numbers look pretty nice, although “-march=native” was needed to get all three SIMD versions.
# make
c++ -O3 -march=native -Wall -Wextra -std=c++17 -o convert convert.cpp
# ./convert
khuong 7.15067
backlinear 30.8381
linear 21.6466
tree 14.2078
treetst 10.0964
treest 6.32523
treebt 2.30575
sse2 4.81692
sse2(2) 4.70681
avx2 2.00375
khuong 7.15235
backlinear 30.8286
linear 21.6496
tree 14.2603
treetst 10.0969
treest 6.32516
treebt 2.30584
sse2 4.81617
sse2(2) 4.70639
avx2 2.00354
khuong 7.15085
backlinear 30.8319
linear 21.6403
tree 14.2665
treetst 10.0935
treest 6.32525
treebt 2.30579
sse2 4.81627
sse2(2) 4.70653
avx2 2.00359
Are approaches that only work for streams of many integers of interest? I’m calculating a base-10 checksum of all 10 digits of a u32 in <1ns per u32 with AVX2. This includes an intermediate step in which the length of the encoded string (without leading zeros) is calculated, but of course it does not include outputting the string.
The layout of the digits within the vectors is pretty bad for conversion-to-string purposes: the digits for a single integer are spread out between 2 adjacent bytes of 5 different vectors. It seems like it would be not so expensive to fix the layout to be friendlier for outputting, especially if the task was just to output u32 of at most 8 digits or u64 of at most 16 digits. I don't know a clever way to output u64 of 20 digits though, I would just have to add another scalar modmul at the beginning to split the u64 into groups of (4,8,8) digit.
sasuke420says:
To compute the length, if you imagine you have the groups of 4 digits in 4 vectors, you can take max(andnot(digits == 0, digitcount))+1 to get the length where digitcount is a constant:
[3,2,1,0]
[7,6,5,4]
[11,10,9,8]
[15,14,13,12]
and max() is actually a sequence of 3 max, shuffle, max, shuffle, max. After that you can use arithmetic to calculate the proper shuffle to move the digits into the front of the vector so you can output a single formatted integer. You end up with a bunch of data dependencies but no branches. The data dependencies will surely rain on your parade though.
uint64_t bottombottom = top % 10000;
it’s supposed to be
uint64_t bottombottom = bottom % 10000;
Fixed.
Another nifty technique which allows 64-bit conversion without 64-bit arithmetic is Douglas W. Jones’ technique described at http://homepage.divms.uiowa.edu/~jones/bcd/decimal.html and implemented in e.g. https://elixir.bootlin.com/linux/latest/source/lib/vsprintf.c#L325.
The Linux code also implements division-by-constant using multiplication manually. While most compilers these days know how to optimize divide by constant to a multiply and shift, they usually can’t infer the limited ranges of the inputs which allows smaller multipliers and no fixups.
16-bit numbers need only one multiply per digit?
void lulz_atoi(char* str, uint16_t val) {
uint64_t lo = val;
uint64_t hi;
__uint128_t x = (__uint128_t)lo * ((0xFFFFFFFFFFFFFFFFULL / 10000) + 1);
hi = x >> 64;
lo = (uint64_t)x;
str[0] = hi + 0x30;
for (int i = 1; i > 64;
lo = (uint64_t)x;
str[i] = hi + 0x30;
}
str[5] = 0;
}
https://gist.github.com/aqrit/2997713a00ad043b4bac42e342294259
Updated gist to 64-bits. I’ve not checked the generated assembly. Not benchmarked against the other implementations because a uint64_t should have 20 decimal digits…
ICX AVX2 numbers look pretty nice, although “-march=native” was needed to get all three SIMD versions.
# make
c++ -O3 -march=native -Wall -Wextra -std=c++17 -o convert convert.cpp
# ./convert
khuong 7.15067
backlinear 30.8381
linear 21.6466
tree 14.2078
treetst 10.0964
treest 6.32523
treebt 2.30575
sse2 4.81692
sse2(2) 4.70681
avx2 2.00375
khuong 7.15235
backlinear 30.8286
linear 21.6496
tree 14.2603
treetst 10.0969
treest 6.32516
treebt 2.30584
sse2 4.81617
sse2(2) 4.70639
avx2 2.00354
khuong 7.15085
backlinear 30.8319
linear 21.6403
tree 14.2665
treetst 10.0935
treest 6.32525
treebt 2.30579
sse2 4.81627
sse2(2) 4.70653
avx2 2.00359
ICX AVX2 numbers look pretty nice, although “-march=native” was needed to get all three SIMD versions.
# make
c++ -O3 -march=native -Wall -Wextra -std=c++17 -o convert convert.cpp
# ./convert
khuong 7.15067
backlinear 30.8381
linear 21.6466
tree 14.2078
treetst 10.0964
treest 6.32523
treebt 2.30575
sse2 4.81692
sse2(2) 4.70681
avx2 2.00375
khuong 7.15235
backlinear 30.8286
linear 21.6496
tree 14.2603
treetst 10.0969
treest 6.32516
treebt 2.30584
sse2 4.81617
sse2(2) 4.70639
avx2 2.00354
khuong 7.15085
backlinear 30.8319
linear 21.6403
tree 14.2665
treetst 10.0935
treest 6.32525
treebt 2.30579
sse2 4.81627
sse2(2) 4.70653
avx2 2.00359
Can we run this type of coding syntax on an online compiler like this https://www.interviewbit.com/online-java-compiler/
Are approaches that only work for streams of many integers of interest? I’m calculating a base-10 checksum of all 10 digits of a u32 in <1ns per u32 with AVX2. This includes an intermediate step in which the length of the encoded string (without leading zeros) is calculated, but of course it does not include outputting the string.
The layout of the digits within the vectors is pretty bad for conversion-to-string purposes: the digits for a single integer are spread out between 2 adjacent bytes of 5 different vectors. It seems like it would be not so expensive to fix the layout to be friendlier for outputting, especially if the task was just to output u32 of at most 8 digits or u64 of at most 16 digits. I don't know a clever way to output u64 of 20 digits though, I would just have to add another scalar modmul at the beginning to split the u64 into groups of (4,8,8) digit.
To compute the length, if you imagine you have the groups of 4 digits in 4 vectors, you can take max(andnot(digits == 0, digitcount))+1 to get the length where digitcount is a constant:
[3,2,1,0]
[7,6,5,4]
[11,10,9,8]
[15,14,13,12]
and max() is actually a sequence of 3 max, shuffle, max, shuffle, max. After that you can use arithmetic to calculate the proper shuffle to move the digits into the front of the vector so you can output a single formatted integer. You end up with a bunch of data dependencies but no branches. The data dependencies will surely rain on your parade though.