// encode a 64-byte input chunk into 72 output bytes
uint64_t high = 0;
for (size_t i = 0; i < 8; i++)
result[0] = encode(input[0], high);
result[8] = high;
By my count this is fewer operations and all basic ops: no potentially expensive multiply. Also, it processes 8 bytes at a time, not 7, which keeps all accesses aligned (this matters much more on non-x86).
On x86, you could take advantage of LEA, which can do (bits << 1) + x in a single instruction, but because you’re shifting left, not right, I think you can only process 7 bytes at a time (like your suggested algorithm), losing some benefit. Still, even at 7 bytes i calculate this as fewer total ops.
randomPostersays:
You must stop your loop one step before, and encode 56bytes in 64bytes, otherwise result[8] will not fit in ascii.
Travis Downssays:
Yes, that’s right: it should only be 7 iterations of the inner loop before storing the upper bits.
Travis Downssays:
The formatter wrecks the indentation in the code samples, but that should be:
for (size_t i = 0; i < 8; i++) result[i] = encode(input[i], high);
result[8] = high;
Steven Stewart-Gallussays:
Less instructions is not always faster.
Not sure what the best way to exploit ILP is though.
Is the input range actually up to 256, rather than 248 as the comment currently suggests?
Yes. You are correct.
To encode in four instructions:
uint64_t convert_to_ascii(uint64_t x) {
return ((0x2040810204081 * (x & 0x80808080808080)) & 0xff80808080808080) ^ x;
}
If you are willing to use a bigger chunk size, you can probably do it faster with something like:
uint64_t encode(uint64_t bits, uint64_t& high) {
high = (high >> 1) | (bits & 0x8080808080808080) ;
return bits & 0x7F7F7F7F7F7F7F7F;
}
// encode a 64-byte input chunk into 72 output bytes
uint64_t high = 0;
for (size_t i = 0; i < 8; i++)
result[0] = encode(input[0], high);
result[8] = high;
By my count this is fewer operations and all basic ops: no potentially expensive multiply. Also, it processes 8 bytes at a time, not 7, which keeps all accesses aligned (this matters much more on non-x86).
On x86, you could take advantage of LEA, which can do (bits << 1) + x in a single instruction, but because you’re shifting left, not right, I think you can only process 7 bytes at a time (like your suggested algorithm), losing some benefit. Still, even at 7 bytes i calculate this as fewer total ops.
You must stop your loop one step before, and encode 56bytes in 64bytes, otherwise result[8] will not fit in ascii.
Yes, that’s right: it should only be 7 iterations of the inner loop before storing the upper bits.
The formatter wrecks the indentation in the code samples, but that should be:
for (size_t i = 0; i < 8; i++) result[i] = encode(input[i], high);
result[8] = high;
Less instructions is not always faster.
Not sure what the best way to exploit ILP is though.