, 3 min read
Fast base16 encoding
Given binary data, we often need to encode it as ASCII text. Email and much of the web effectively works in this manner.
A popular format for this purpose is base64. With Muła, we showed that we could achieve excellent speed using vector instructions on commodity processors (2018, 2020). However, base64 is a bit tricky.
A much simpler format is just base16. E.g., you just transcribe each byte into two bytes representing the value in hexadecimal notation. Thus the byte value 1 becomes the two bytes ’01’. The byte value 255 becomes ‘FF’, and so forth. In other words, you use one byte (or one character) per ‘nibble’: a byte is made of two nibbles: the most-significant 4 bits and the least-significant 4 bits.
How could encode base16 quickly? A reasonable approach might be to use a table. You grab one byte from the input and you directly lookup the 2 bytes from the output which you immediately write out:
void encode_scalar(const uint8_t *source, size_t len, char *target) {
const uint16_t table[] = {
0x3030, 0x3130, 0x3230, 0x3330, 0x3430, ...
0x6366, 0x6466, 0x6566, 0x6666};
for (size_t i = 0; i < len; i++) {
uint16_t code = table[source[i]];
::memcpy(target, &code, 2);
target += 2;
}
}
It requires a 512-byte table but that is not concerning.
Could we do better?
Milosz Krajewski wrote some good-looking C# code using vector instructions. I wrote something that should be the equivalent using x64 C++. We have both routines for 128-bit and 256-bit vectors. My code is for demonstration purposes but it is essentially correct.
The core idea is not complicated. You must grab a vector of bytes. Then you must somehow expand it out: each nibble must go into a byte. And then the magic is this: we use the fast vectorized lookup (e.g., the pshufb instruction) to look up each nibble into a 16-byte table containing the letters ‘0’, ‘1’…’a’, …’f’.
Here is the 128-bit code using Intel intrinsics:
__m128i shuf = _mm_set_epi8('f', 'e', 'd', 'c', 'b', 'a', '9', '8', '7', '6',
'5', '4', '3', '2', '1', '0');
size_t i = 0;
__m128i maskf = _mm_set1_epi8(0xf);
for (; i + 16 <= len; i += 16) {
__m128i input = _mm_loadu_si128((const __m128i *)(source + i));
__m128i inputbase = _mm_and_si128(maskf, input);
__m128i inputs4 =
_mm_and_si128(maskf, _mm_srli_epi16(input, 4));
__m128i firstpart = _mm_unpacklo_epi8(inputs4, inputbase);
__m128i output1 = _mm_shuffle_epi8(shuf, firstpart);
__m128i secondpart = _mm_unpackhi_epi8(inputs4, inputbase);
__m128i output2 = _mm_shuffle_epi8(shuf, secondpart);
_mm_storeu_si128((__m128i *)(target), output1);
target += 16;
_mm_storeu_si128((__m128i *)(target), output2);
target += 16;
}
The 256-bit code is roughly the same, with one extra instruction to shuffle the bytes to compensate for the fact that 256-bit instructions work ‘per lane’ (in units of 128-bit words). My source code is available.
We might therefore expect the 256-bit to be maybe twice as fast? My results on an icelake processor with GCC11:
table lookup | 0.9 GB/s |
---|---|
128-bit vectors | 6.4 GB/s |
256-bit vectors | 11 GB/s |
We are not quite twice as fast, but close enough. I do not find these speeds very satisfying: I expect that less naive code could be faster.
Milosz gets much poorer results in C#: the 256-bit code is barely faster than the 128-bit code, but he does some relatively complicated computation in the 256-bit code instead of just calling the 256-bit shuffle instruction (vpshufb). (I suspect that he will soon fix this issue if he can.)
Our code would work on ARM as well after minor changes. For AVX-512 or SVE, we may need different approaches. We could add both encoding and decoding.