27th July 2023, 3 min read Decoding base16 sequences quickly 4 thoughts on “Decoding base16 sequences quickly” aqrit says: July 27, 2023 at 7:43 pm Geoff Langdale’s implementation was likely meant to be SSE2 compatible, whereas vectorized table lookups require SSSE3. Daniel Lemire says: July 27, 2023 at 9:28 pm You can find the implementation there: https://github.com/WojciechMula/toys/blob/master/simd-parse-hex/geoff_algorithm.cpp sasuke420 says: August 5, 2023 at 6:10 pm for my current solution to this sort of problem at https://highload.fun/ I am using this sequence const u8x32 pack_odd = _mm256_setr_epi8( 15, 13, 11, 9, 7, 5, 3, 1, 15, 13, 11, 9, 7, 5, 3, 1, 15, 13, 11, 9, 7, 5, 3, 1, 15, 13, 11, 9, 7, 5, 3, 1); .... const u8x32 f_0 = _mm256_slli_epi16(e_0, 12); const u8x32 g_0 = _mm256_or_si256(f_0, e_0); const u8x32 h_0 = _mm256_shuffle_epi8(g_0, pack_odd); rather than something like __m128i t3 = _mm_maddubs_epi16(v, _mm_set1_epi16(0x0110)); __m128i t5 = _mm_packus_epi16(t3, t3); I’ll have to try that out. The docs say I’ll suffer some latency loss, but it could still be a win. sasuke420 says: August 5, 2023 at 6:12 pm Well, now that I look at what I’ve posted it looks like I am packing and bswapping at the same time, so I would need the shuffle anyway.
Geoff Langdale’s implementation was likely meant to be SSE2 compatible, whereas vectorized table lookups require SSSE3.
You can find the implementation there:
https://github.com/WojciechMula/toys/blob/master/simd-parse-hex/geoff_algorithm.cpp
for my current solution to this sort of problem at https://highload.fun/ I am using this sequence
const u8x32 pack_odd = _mm256_setr_epi8(
15, 13, 11, 9, 7, 5, 3, 1, 15, 13, 11, 9, 7, 5, 3, 1,
15, 13, 11, 9, 7, 5, 3, 1, 15, 13, 11, 9, 7, 5, 3, 1);
....
const u8x32 f_0 = _mm256_slli_epi16(e_0, 12);
const u8x32 g_0 = _mm256_or_si256(f_0, e_0);
const u8x32 h_0 = _mm256_shuffle_epi8(g_0, pack_odd);
rather than something like
__m128i t3 = _mm_maddubs_epi16(v, _mm_set1_epi16(0x0110));
__m128i t5 = _mm_packus_epi16(t3, t3);
I’ll have to try that out. The docs say I’ll suffer some latency loss, but it could still be a win.
Well, now that I look at what I’ve posted it looks like I am packing and bswapping at the same time, so I would need the shuffle anyway.