We want to store the content to a string. The string is not (in my mind) always aligned on a two-byte boundary. Now, if it is, your code is indeed simpler. But the code from the blog post won’t trigger undefined behaviour.
Is your code faster ? You seem to think so, but you do not report on any benchmark. I remind you that I publish my code, so you can modify it and see whether you can make it go faster.
Of course, which compiler you use matters, and which settings.
May be worth tossing in a comparison with the ‘traditional’ approach, that is roughly:
void encode_avx2(const uint8_t *restrict source, size_t len, char *restrict target) {
__m256i lomask = _mm256_set1_epi8(0x0F);
__m256i himask = _mm256_set1_epi8(0xF0);
__m256i letter = _mm256_set1_epi8(9);
__m256i numberdiff = _mm256_set1_epi8(‘0’);
__m256i letterdiff = _mm256_set1_epi8(‘a’ – 10);
for (size_t i = 0; i + 32 <= len; i += 32) {
__m256i input = _mm256_loadu_si256((const __m256i *)(source + i));
__m256i sinput = _mm256_permute4x64_epi64(input, 0b11011000);
__m256i hs = _mm256_and_si256(sinput, himask);
__m256i hi = _mm256_srli_epi16(hs, 4);
__m256i lo = _mm256_and_si256(sinput, lomask);
__m256i loletter = _mm256_cmpgt_epi8(lo, letter);
__m256i hiletter = _mm256_cmpgt_epi8(hi, letter);
__m256i looffset = _mm256_blendv_epi8(numberdiff, letterdiff, loletter);
__m256i hioffset = _mm256_blendv_epi8(numberdiff, letterdiff, hiletter);
__m256i loout = _mm256_add_epi8(lo, looffset);
__m256i hiout = _mm256_add_epi8(hi, hioffset);
__m256i firstt = _mm256_unpacklo_epi8(loout, hiout);
__m256i second = _mm256_unpackhi_epi8(loout, hiout);
_mm256_storeu_si256((__m256i *)(target), firstt);
target += 32;
_mm256_storeu_si256((__m256i *)(target), second);
target += 32;
}
}
(Warning: completely untested.)
This is just a fairly rote translation of:
uint8_t lo = input & 0xF;
uint8_t hi = (input >> 4) & 0xF;
lo += (lo > 9) ? (‘a’ – 10) : ‘0’;
hi += (hi > 9) ? (‘a’ – 10) : ‘0’;
The shuffle approach should be faster. 2x vpshufb versus 2x vpcmpgtb and 2x vpblendvb and 2x vpaddb.
The comment in the K4os.Text.BaseX README is slightly out-of-date; there is now an optimised Base16 encoding method in .NET: Convert.ToHexString.
The source code is vectorised: https://github.com/dotnet/runtime/blob/main/src/libraries/Common/src/System/HexConverter.cs
I did not profile the built-in NET implementation against the one you linked to.
(Turns out I wrote the Ascii85 test code that Milosz linked to in the README, though!)
Man, that memcpy looks criminal there. Why not have your table of shorts?
Then you could write:
for (; len; len–) {
*target++ = table[*source++];
}
Don’t be shy with pointer arithmetic, either. Most likely the compiler would do the right thing for you, but you can never know.
And then people doubtful of compilers go further and unroll the loop. Sometimes is makes a difference, sometimes it doesn’t.
And still this is not half of what you could do.
Compare the SIMD code with the best non SIMD code you can write.
Otherwise, what’s the point?
We want to store the content to a string. The string is not (in my mind) always aligned on a two-byte boundary. Now, if it is, your code is indeed simpler. But the code from the blog post won’t trigger undefined behaviour.
Is your code faster ? You seem to think so, but you do not report on any benchmark. I remind you that I publish my code, so you can modify it and see whether you can make it go faster.
Of course, which compiler you use matters, and which settings.
Using GCC with -O3, your code compiles to :
GCC compiles the code from the blog post to
It looks very similar to me.
It also looks like what you’d expect. Load the byte value, look up the short, store the short, increment, check if we are at the end of the loop.
Unrolling might reduce the instruction count, but GCC is not eager to unroll, even with -O3. I would also not be tempted to unroll.
So I think that the code from the blog post is decent and high performance.
In this case, the memcpy was nicely optimized away. When that does not happen, all bets are off.
On my system (i5-6200U CPU @ 2.30GHz from 2015)
for (size_t i = 0; i >= 2;
for (; len; len–) {
*target++ = table[*source++];
*target++ = table[*source++];
*target++ = table[*source++];
*target++ = table[*source++];
}
for (; i; i–) {
*target++ = table[*source++];
}
runs in .45s.
admittedly, one never knows when the loop unrolling pays off.
I don’t know what happened with my earlier reply.
The site made a mess of it.
this:
for (size_t i = 0; i >= 2;
for (; len; len–) {
*target++ = table[*source++];
*target++ = table[*source++];
*target++ = table[*source++];
*target++ = table[*source++];
}
for (; i; i–) {
*target++ = table[*source++];
}
runs in .45s.
Hmm. Same thing again. Some text is removed in the middle.
Strange bug. Maybe there’s some character it doesn’t like.
for (size_t i = 0; i >= 2;
for (; len; len–) {
*target++ = table[*source++];
*target++ = table[*source++];
*target++ = table[*source++];
*target++ = table[*source++];
}
for (; i; i–) {
*target++ = table[*source++];
}
With target being short *.
Right. I can’t do this.
What I was saying is that the original code takes .62s.
The unrolled code is .45s.