How about a super lazy and wasteful method where you make a 4GB lookup table? Of those 4GB only 64KB (the 16^4 possible digits) will ever need to be read and the chunks should be close together that they should get cached. It’s silly, but your metric was instruction count 🙂
Yes, it is, see my code. Unfortunately, the SWAR/vector approach I have (borrowed from Mula) has two downsides. One is that it uses a fancy instruction (pext) to gather the nibbles… this instruction is famously slow on AMD processors and does not exist on non-x64 processors. The other problem is that it does not check for bad input.
So I get a 15% performance gain, but I trade portability and error checking.
We need to do better.
some dudesays:
maybe (i & 0xF)+ (i >> 6 << 3) + (i >> 6) will pipeline better than (c & 0xF) + 9 * (c >> 6),
swapping out a multiplication for a shift and an addition (assuming i>>6 is pipelined well)
I’d be curious about the performance with a 16-bit lookup table, for parsing pairs of hex digits at a time, since nearly all uses of hex lookups are looking them up in pairs anyway. (So, tab[0x3131] = 17)
216says:
no pext or bswap needed if you do this:
val = (val & 0xf0f0f0f) + 9 * (val >> 6 & 0x1010101);
val = (val | val << 12) & 0xff00ff00;
return (val>>24 | val) & 0xffff;
216says:
Well ok, seems that’d be essentially the same as Lee’s fast algorithm.. I guess modern x86’s are pretty crazy good at optimizing cached memory accesses if it’s still losing (despite lack of error-checking).
It looks like Cordes’ and og’s methods convert numbers to hex, whereas we are going the other way.
Your link seems more relevant, and it is certainly interesting but you appear to critically rely on the fact that you have sizeable blocks of hexadecimal numbers (like 32 hex characters in a row). That certainly occurs in cryptography, for example.
Do you check that the input is valid?
Zach Bsays:
You’re right, I misremembered there being string->number procedures in the two later links. My version doesn’t check the input characters, no, was just an exercise in encoding/decoding without much practical consideration :).
Christopher Changsays:
No love for _mm_packs_epi16() and _mm256_packs_epi16()? It only applies to length >= 16 and really wants length >= 32, but if you care about the speed of this operation, it’s pretty likely you’re dealing with such lengths.
Basic idea: there are a bunch of ways to use SIMD operations to replace all ‘0’..’9′ bytes in a vector with 0..9 byte values, and ‘A’..’F’/’a’..’f’ bytes with 10..15 byte values. From there,
(“m8” is assumed to have even bytes initialized to 0, and odd bytes to 0xff. “base16_vec” is the result of the aforementioned preprocessing step.)
// Copy the low bits of byte 1, 3, 5, ... to the high bits of byte 0, 2, 4, ... so that the even positions have the final values of interest.
base16_vec = _mm_or_si128(base16_vec, _mm_srli_epi64(base16_vec, 4));
// Mask out odd bytes.
__m128i prepack0 = _mm_and_si128(base16_vec, m8);
// Gather and pack even bytes. This can also be done with _mm_shuffle_epi8.
// prepack1 corresponds to another 16 bytes of the input. If they don't exist, you can put an arbitrary second argument, and then store just the bottom 8 bytes of the result.
__m128i packed = _mm_packs_epi16(prepack0, prepack1);
_mm_storeu_si128(target, packed);
Throw a _mm256_permute4x64_epi64(…, 0xd8) in the middle to make the AVX2 version of this work, since both _mm256_packs_epi16() and _mm256_shuffle_epi8() handle each lane separately.
Christopher Changsays:
Oops, just noticed the “short” part of the title, so I guess the code above isn’t so relevant here.
there are a bunch of ways to use SIMD operations to replace all ‘0’..’9′ bytes in a vector with 0..9 byte values, and ‘A’..’F’/’a’..’f’ bytes with 10..15 byte values
I think that the efficiency of this part in particular is important… and, for extra points, whether you can also detect bad inputs in the process.
aqritsays:
uint32_t hex_to_u32_test(const uint8_t *src) {
uint32_t in;
uint64_t v, x;
const int64_t magic = INT64_C(0x1001001000000000);
memcpy(&in, src, 4);
v = in;
x = (((0x00404040 & v) >> 6) * 9) + (v & 0x000F0F0F); // do 3
x = (((uint64_t)((int64_t)x * magic)) >> 48) & ~15; // bswap and pack
v = ((v >> 30) * 9) + ((v >> 24) & 0x0F); // do the 4th
return (x | v);
}
Experiment: replace pext with multiply/shift
The benchmark is returning strange numbers on my pc…
I have added it to my benchmark and you are correct. It is really fast.
Philippsays:
Is there a specific reason for not using -O3 ?
Compared to -O2, -O3 leads to “math” being the most efficient approach with 2.63 cycles per 4-character hex string (“lookup”: 3.51) and an instruction count of 13.3 (lookup: 14.3) on a skylake cpu.
Interestingly, modifying the line (“lookup” approach):
while using -O2 + -fpredictive-commoning (part of -O3) leads to this customized “lookup” being the best approach having ~10% reduced cycle count [and ~25% reduced cache misses with -O3]. I wasn’t expecting that!
Is there a specific reason for not using -O3 ? Compared to -O2, -O3 leads to “math” being the most efficient approach with 2.63 cycles per 4-character hex string (“lookup”: 3.51) and an instruction count of 13.3 (lookup: 14.3) on a skylake cpu.
On GCC, -O3 enables autovectorization which is likely able to totally defeat the benchmark. I have switched the flags to -O3, but I have added function attributes to prevent function inlining (to ensure that each 4-char is processed) and there is no change.
degskisays:
I cannot run your benchmark-code (on Windows), but this is an option (maybe):
Does indeed not look good at all [I wasn’t expecting anything stellar, but still]. Interesting and food for thought! The unordered_map is faster, branch misses seem to be the culprit of it all. What’s going on with a 45 bytes map?
The 4 LUTs can even be “compressed” a bit by getting rid of redundant “-1” between LUTS (I guess the compiler/linker should already do it but when one wants to make sure, or have an assembly implementation using the same LUT where we can’t rely on compiler/linker to do the job).
This is as fast as mula’s method on my Haswell while still having a checked input.
It’s more portable than both mula’s method and the big LUT implementation.
Portability of mula’s method was already discussed.
Portability of the big LUT: it casts the input uint8_t pointer to an uint16_t pointer which requires a stricter alignment (UBSAN will complain about that on unaligned access). It does not matter on x86/x64 where it’s supported and fast but might crash or be dead slow on other architectures when the input is not aligned.
foobarsays:
I wonder how rewriting convertone using a conditional move instruction (for example, lea+lea+cmp+cmov) would compare.
More interestingly, I think this could be performed in a vectored fashion even with pre-AVX instructions: couple 8-bit-wide add, cmp* and blend operations. With a 8-bit-wide shuffle and 16-bit-wide madd (all of these are relatively light-weight instructions), one could construct 16-bit parts of the decoded integer in couple extra instructions…(?)
I might be horribly mistaken as I didn’t really try to implement it, but maybe this (four hexadecimal digits to an int) could be accomplished in six vectorized computing instructions plus couple of register moves! Eight hexadecimal digits could maybe accomplished with an additional packssdw.
foobarsays:
OK, case insensitivity probably requires one more and instruction. Nonetheless, vectored version operating essentially on 64-bit wide registers should be roughly of comparable performance with non-vectored variants… (in theory.)
Wilcosays:
The use of -O3 combined with a loop increment of 1 (rather than 4) causes the benchmark to report incorrect data. GCC unrolls the loop by 3x and removes many of the loads (since the same lookups are done 4 times for each byte!) so that there are only 9 loads per 3 iterations instead of the expected 3 * 8 = 24. This significantly overestimates performance, particularly of the inlined variants.
Changing the increment to 4 generates the expected code. Eg. for AArch64:
It confirms that the parser loop is load-limited. We can optimize this by reading a word at a time, extract the bytes using masking, and then do the lookup. This way you need at most 5 loads every 4 bytes rather than 8.
It starts to make more sense with less difference between the inlined and out-of-line versions (as you’d expect). However the inlined big lookup result at 2.6 instructions per 4 bytes is obviously wrong – it should be 11. I didn’t check the rest, but there are more odd results, for example the empty function instruction count/cycles (presumably it was optimized since the function has no side effects).
To be sure the code is as expected you always need to disassemble, especially when it’s a benchmark. Compilers are far smarter than most people think, it’s typical for benchmark loops to be optimized away!
(…) for example the empty function instruction count/cycles (presumably it was optimized since the function has no side effects)
My belief is that GCC will never optimize away a noinline function, even if it has no side-effect.
(Update: this is wrong due to the -O3 flag but not under the -O2 flag.)
it's typical for benchmark loops to be optimized away
I am explicitly using noinline attributes. I have the following comment throughout my source code…
> __attribute__((noinline)) // we do not want the compiler to rewrite the problem
My expectation is that with GCC, such functions do not get inlined and thus we genuinely benchmark the cost of the function (with additional overhead due to the function call).
I left a couple of inlineable functions, marking them as such for reference. Maybe this is a source of confusion?
Interesting. -O3 under GCC seems to allow cheating on the noinline attribute.
I have reverted back to -O2 and put the noinline. And I now check that the compiler is GCC.
I have removed all inlineable functions.
Wilcosays:
GCC will always honor the noinline attribute, but that doesn’t stop it from optimizing the call out of the loop if it is pure or const. Inline functions are fine but they need to be given real work to do, otherwise the compiler will just remove the loop.
Looking at the assembly, what seems to be happening is that under -O3, the compiler must figure out that the function is const and thus while it still calls it it can optimize away calls.
You can get around the problem, even under -O3, by actually returning the input itself.
That is, under -O3, it seems that it can optimize away const noinline functions but not all pure functions. Under -O2, it will not optimize away the calls at all.
Inline functions are fine…
It is a typo, right? You meant “noinline functions”?
Note that this does not affect my actual benchmarks… the “bogus” function call was added as a reference but never actually used for anything. I should even remove it.
Wilcosays:
No I did mean inline. On AArch64 I can build with -O3 -Dnoinline=inline and everything except for “bogus” function executes as expected. I haven’t checked what went wrong with the big lookup previously, but the disassembly for the inlined big lookup seems fine (we could save another 2 instructions if GCC did the address computations more efficiently):
How about a super lazy and wasteful method where you make a 4GB lookup table? Of those 4GB only 64KB (the 16^4 possible digits) will ever need to be read and the chunks should be close together that they should get cached. It’s silly, but your metric was instruction count 🙂
uint32_t hex_to_u32_lookup(const uint32_t src) {
return digittoval[src];
}
My metric is not instruction count! However, the mathematical approach generates many more instructions than you’d like.
Might it be possible to pseudo-vectorize this via bit twiddling, thus better amortizing the math operations?
Yes, it is, see my code. Unfortunately, the SWAR/vector approach I have (borrowed from Mula) has two downsides. One is that it uses a fancy instruction (pext) to gather the nibbles… this instruction is famously slow on AMD processors and does not exist on non-x64 processors. The other problem is that it does not check for bad input.
So I get a 15% performance gain, but I trade portability and error checking.
We need to do better.
maybe (i & 0xF)+ (i >> 6 << 3) + (i >> 6) will pipeline better than (c & 0xF) + 9 * (c >> 6),
swapping out a multiplication for a shift and an addition (assuming i>>6 is pipelined well)
I’d be curious about the performance with a 16-bit lookup table, for parsing pairs of hex digits at a time, since nearly all uses of hex lookups are looking them up in pairs anyway. (So, tab[0x3131] = 17)
no pext or bswap needed if you do this:
val = (val & 0xf0f0f0f) + 9 * (val >> 6 & 0x1010101);
val = (val | val << 12) & 0xff00ff00;
return (val>>24 | val) & 0xffff;
Well ok, seems that’d be essentially the same as Lee’s fast algorithm.. I guess modern x86’s are pretty crazy good at optimizing cached memory accesses if it’s still losing (despite lack of error-checking).
@216 Your approach is more portable but slower than Mula’s according my updated benchmarks.
I know this is for “short” strings, but since you’re interested in SIMD, there are a few published hex SIMD encoders/decoders that you might enjoy:
My version at https://github.com/zbjornson/fast-hex
Peter Cordes’ methods at https://stackoverflow.com/questions/53823756/how-to-convert-a-number-to-hex/53823757#53823757
Agner Fog’s, https://github.com/darealshinji/vectorclass/blob/master/special/decimal.h#L828
It looks like Cordes’ and og’s methods convert numbers to hex, whereas we are going the other way.
Your link seems more relevant, and it is certainly interesting but you appear to critically rely on the fact that you have sizeable blocks of hexadecimal numbers (like 32 hex characters in a row). That certainly occurs in cryptography, for example.
Do you check that the input is valid?
You’re right, I misremembered there being string->number procedures in the two later links. My version doesn’t check the input characters, no, was just an exercise in encoding/decoding without much practical consideration :).
No love for _mm_packs_epi16() and _mm256_packs_epi16()? It only applies to length >= 16 and really wants length >= 32, but if you care about the speed of this operation, it’s pretty likely you’re dealing with such lengths.
Basic idea: there are a bunch of ways to use SIMD operations to replace all ‘0’..’9′ bytes in a vector with 0..9 byte values, and ‘A’..’F’/’a’..’f’ bytes with 10..15 byte values. From there,
(“m8” is assumed to have even bytes initialized to 0, and odd bytes to 0xff. “base16_vec” is the result of the aforementioned preprocessing step.)
// Copy the low bits of byte 1, 3, 5, ... to the high bits of byte 0, 2, 4, ... so that the even positions have the final values of interest.
base16_vec = _mm_or_si128(base16_vec, _mm_srli_epi64(base16_vec, 4));
// Mask out odd bytes.
__m128i prepack0 = _mm_and_si128(base16_vec, m8);
// Gather and pack even bytes. This can also be done with _mm_shuffle_epi8.
// prepack1 corresponds to another 16 bytes of the input. If they don't exist, you can put an arbitrary second argument, and then store just the bottom 8 bytes of the result.
__m128i packed = _mm_packs_epi16(prepack0, prepack1);
_mm_storeu_si128(target, packed);
Throw a _mm256_permute4x64_epi64(…, 0xd8) in the middle to make the AVX2 version of this work, since both _mm256_packs_epi16() and _mm256_shuffle_epi8() handle each lane separately.
Oops, just noticed the “short” part of the title, so I guess the code above isn’t so relevant here.
I think that the efficiency of this part in particular is important… and, for extra points, whether you can also detect bad inputs in the process.
uint32_t hex_to_u32_test(const uint8_t *src) {
uint32_t in;
uint64_t v, x;
const int64_t magic = INT64_C(0x1001001000000000);
memcpy(&in, src, 4);
v = in;
x = (((0x00404040 & v) >> 6) * 9) + (v & 0x000F0F0F); // do 3
x = (((uint64_t)((int64_t)x * magic)) >> 48) & ~15; // bswap and pack
v = ((v >> 30) * 9) + ((v >> 24) & 0x0F); // do the 4th
return (x | v);
}
Experiment: replace
pext
withmultiply/shift
The benchmark is returning strange numbers on my pc…
It is nice because you get rid of pext. The net result is slower but more portable.
According to your benchmarking code, looking up 2 digits at a time is faster without affecting cache misses on average.
uint32_t lookup2[65536];
void init_lookup2() {
for (int i = 0; i < 0x10000; i++) {
lookup2[i] = (uint32_t) -1;
}
for (int i = 0; i < 256; i++) {
char digits[3];
sprintf(digits, "%02x", i);
uint16_t lvalue;
memcpy(&lvalue, digits, 2);
lookup2[lvalue] = i;
char digits_upper[3];
digits_upper[0] = toupper(digits[0]);
digits_upper[1] = toupper(digits[1]);
digits_upper[2] = 0;
if ((digits_upper[0] != digits[0]) ||
(digits_upper[1] != digits[1])) {
memcpy(&lvalue, digits_upper, 2);
lookup2[lvalue] = i;
}
}
}
uint32_t hex_2bytes_lookup(const uint8_t *src) {
uint32_t v1 = lookup2[((uint16_t*) src)[0]];
uint32_t v2 = lookup2[((uint16_t*) src)[1]];
return v1 << 8 | v2;
}
I have added it to my benchmark and you are correct. It is really fast.
Is there a specific reason for not using -O3 ?
Compared to -O2, -O3 leads to “math” being the most efficient approach with 2.63 cycles per 4-character hex string (“lookup”: 3.51) and an instruction count of 13.3 (lookup: 14.3) on a skylake cpu.
Interestingly, modifying the line (“lookup” approach):
return static_cast<uint32_t>(v1 << 12 | v2 << 8 | v3 << 4 | v4);
to
return static_cast<uint32_t>(v1 *16*16*16 + v2 *16*16 + v3 *16 + v4);
while using -O2 + -fpredictive-commoning (part of -O3) leads to this customized “lookup” being the best approach having ~10% reduced cycle count [and ~25% reduced cache misses with -O3]. I wasn’t expecting that!
Is there a specific reason for not using -O3 ? Compared to -O2, -O3 leads to “math” being the most efficient approach with 2.63 cycles per 4-character hex string (“lookup”: 3.51) and an instruction count of 13.3 (lookup: 14.3) on a skylake cpu.
On GCC, -O3 enables autovectorization which is likely able to totally defeat the benchmark. I have switched the flags to -O3, but I have added function attributes to prevent function inlining (to ensure that each 4-char is processed) and there is no change.
I cannot run your benchmark-code (on Windows), but this is an option (maybe):
include <frozen/unordered_map.h>
int main ( ) {
constexpr frozen::unordered_map<char, char, 22> lookup {
{ '0', 0 },
{ '1', 1 },
{ '2', 2 },
{ '3', 3 },
{ '4', 4 },
{ '5', 5 },
{ '6', 6 },
{ '7', 7 },
{ '8', 8 },
{ '9', 9 },
{ 'a', 10 },
{ 'b', 11 },
{ 'c', 12 },
{ 'd', 13 },
{ 'e', 14 },
{ 'f', 15 },
{ 'A', 10 },
{ 'B', 11 },
{ 'C', 12 },
{ 'D', 13 },
{ 'E', 14 },
{ 'F', 15 }
};
std::cout << ( int ) lookup.at ( 'A' ) << nl;
return EXIT_SUCCESS;
}
https://github.com/serge-sans-paille/frozen
Swapping out the frozen::unordered_map (size 1088 bytes) for a frozen::map (size 45 bytes) possibly improves things due to better locality.
include <frozen/map.h>
int main ( ) {
alignas ( 64 ) constexpr frozen::map<char, char, 22> lookup {
{ '0', 0 },
{ '1', 1 },
{ '2', 2 },
{ '3', 3 },
{ '4', 4 },
{ '5', 5 },
{ '6', 6 },
{ '7', 7 },
{ '8', 8 },
{ '9', 9 },
{ 'a', 10 },
{ 'b', 11 },
{ 'c', 12 },
{ 'd', 13 },
{ 'e', 14 },
{ 'f', 15 },
{ 'A', 10 },
{ 'B', 11 },
{ 'C', 12 },
{ 'D', 13 },
{ 'E', 14 },
{ 'F', 15 }
};
std::cout << ( int ) lookup.at ( 'A' ) << nl;
std::cout << sizeof ( frozen::map<char, char, 22> ) << nl;
return EXIT_SUCCESS;
}
I have implemented and benchmarked your ‘frozen’ proposal. See my README.
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2019/04/17
The numbers are not good.
Thanks for adding those two!
Does indeed not look good at all [I wasn’t expecting anything stellar, but still]. Interesting and food for thought! The unordered_map is faster, branch misses seem to be the culprit of it all. What’s going on with a 45 bytes map?
Another method used in some base64 decoders uses 4×256 uint32 LUT (4kB) such as in:
Nick Galbreath: client9/stringencoders
Alfred Klomp: aklomp/base64
That allows to get rid of all shits and have them precomputed. It still has error checking:
__attribute__ ((noinline))
uint32_t hex_to_u32_lookup4x32(const uint8_t *src) {
uint32_t v1 = digittoval1[src[0]];
uint32_t v2 = digittoval2[src[1]];
uint32_t v3 = digittoval3[src[2]];
uint32_t v4 = digittoval4[src[3]];
return static_cast<uint32_t>(v1 | v2 | v3 | v4);
}
The 4 LUTs can even be “compressed” a bit by getting rid of redundant “-1” between LUTS (I guess the compiler/linker should already do it but when one wants to make sure, or have an assembly implementation using the same LUT where we can’t rely on compiler/linker to do the job).
__attribute__ ((noinline))
uint32_t hex_to_u32_lookup_mayeut(const uint8_t *src) {
uint32_t v1 = static_cast<uint32_t>(digittoval32[0 + src[0]]);
uint32_t v2 = static_cast<uint32_t>(digittoval32[210 + src[1]]);
uint32_t v3 = static_cast<uint32_t>(digittoval32[420 + src[2]]);
uint32_t v4 = static_cast<uint32_t>(digittoval32[630 + src[3]]);
return v1 | v2 | v3 | v4;
}
This is as fast as mula’s method on my Haswell while still having a checked input.
It’s more portable than both mula’s method and the big LUT implementation.
Portability of mula’s method was already discussed.
Portability of the big LUT: it casts the input uint8_t pointer to an uint16_t pointer which requires a stricter alignment (UBSAN will complain about that on unaligned access). It does not matter on x86/x64 where it’s supported and fast but might crash or be dead slow on other architectures when the input is not aligned.
I wonder how rewriting convertone using a conditional move instruction (for example, lea+lea+cmp+cmov) would compare.
More interestingly, I think this could be performed in a vectored fashion even with pre-AVX instructions: couple 8-bit-wide add, cmp* and blend operations. With a 8-bit-wide shuffle and 16-bit-wide madd (all of these are relatively light-weight instructions), one could construct 16-bit parts of the decoded integer in couple extra instructions…(?)
I might be horribly mistaken as I didn’t really try to implement it, but maybe this (four hexadecimal digits to an int) could be accomplished in six vectorized computing instructions plus couple of register moves! Eight hexadecimal digits could maybe accomplished with an additional packssdw.
OK, case insensitivity probably requires one more and instruction. Nonetheless, vectored version operating essentially on 64-bit wide registers should be roughly of comparable performance with non-vectored variants… (in theory.)
The use of -O3 combined with a loop increment of 1 (rather than 4) causes the benchmark to report incorrect data. GCC unrolls the loop by 3x and removes many of the loads (since the same lookups are done 4 times for each byte!) so that there are only 9 loads per 3 iterations instead of the expected 3 * 8 = 24. This significantly overestimates performance, particularly of the inlined variants.
Changing the increment to 4 generates the expected code. Eg. for AArch64:
.L281:
ldrb w5, [x0, 1]
add x6, x6, 4
ldrb w1, [x0]
cmp x19, x6
ldrb w4, [x0, 2]
add x0, x0, 4
ldrb w7, [x0, -1]
ldrsb w5, [x3, w5, sxtw]
ldrsb w1, [x3, w1, sxtw]
ldrsb w4, [x3, w4, sxtw]
ldrsb w7, [x3, w7, sxtw]
lsl w5, w5, 8
orr w1, w5, w1, lsl 12
orr w4, w7, w4, lsl 4
orr w1, w1, w4
add x23, x23, x1
bhi .L281
It confirms that the parser loop is load-limited. We can optimize this by reading a word at a time, extract the bytes using masking, and then do the lookup. This way you need at most 5 loads every 4 bytes rather than 8.
Thank you. I will modify the benchmark code.
It starts to make more sense with less difference between the inlined and out-of-line versions (as you’d expect). However the inlined big lookup result at 2.6 instructions per 4 bytes is obviously wrong – it should be 11. I didn’t check the rest, but there are more odd results, for example the empty function instruction count/cycles (presumably it was optimized since the function has no side effects).
To be sure the code is as expected you always need to disassemble, especially when it’s a benchmark. Compilers are far smarter than most people think, it’s typical for benchmark loops to be optimized away!
(Comment edited.)
(…) for example the empty function instruction count/cycles (presumably it was optimized since the function has no side effects)
My belief is that GCC will never optimize away a noinline function, even if it has no side-effect.
(Update: this is wrong due to the -O3 flag but not under the -O2 flag.)
it's typical for benchmark loops to be optimized away
I am explicitly using noinline attributes. I have the following comment throughout my source code…
> __attribute__((noinline)) // we do not want the compiler to rewrite the problem
My expectation is that with GCC, such functions do not get inlined and thus we genuinely benchmark the cost of the function (with additional overhead due to the function call).
I left a couple of inlineable functions, marking them as such for reference. Maybe this is a source of confusion?
I will delete them right away.
Interesting. -O3 under GCC seems to allow cheating on the noinline attribute.
I have reverted back to -O2 and put the noinline. And I now check that the compiler is GCC.
I have removed all inlineable functions.
GCC will always honor the noinline attribute, but that doesn’t stop it from optimizing the call out of the loop if it is pure or const. Inline functions are fine but they need to be given real work to do, otherwise the compiler will just remove the loop.
Looking at the assembly, what seems to be happening is that under -O3, the compiler must figure out that the function is const and thus while it still calls it it can optimize away calls.
You can get around the problem, even under -O3, by actually returning the input itself.
That is, under -O3, it seems that it can optimize away const noinline functions but not all pure functions. Under -O2, it will not optimize away the calls at all.
It is a typo, right? You meant “noinline functions”?
Note that this does not affect my actual benchmarks… the “bogus” function call was added as a reference but never actually used for anything. I should even remove it.
No I did mean inline. On AArch64 I can build with -O3 -Dnoinline=inline and everything except for “bogus” function executes as expected. I haven’t checked what went wrong with the big lookup previously, but the disassembly for the inlined big lookup seems fine (we could save another 2 instructions if GCC did the address computations more efficiently):
.L593:
add x4, x19, x0
ldrh w1, [x19, x0]
add x0, x0, 4
cmp x21, x0
ldrh w4, [x4, 2]
ldr w1, [x3, x1, lsl 2]
ldr w4, [x3, x4, lsl 2]
orr w1, w4, w1, lsl 8
add x20, x20, x1
bhi .L593