Daniel Lemire's blog

, 3 min read

Decoding base16 sequences quickly

4 thoughts on “Decoding base16 sequences quickly”

  1. aqrit says:

    Geoff Langdale’s implementation was likely meant to be SSE2 compatible, whereas vectorized table lookups require SSSE3.

  2. sasuke420 says:

    for my current solution to this sort of problem at https://highload.fun/ I am using this sequence

    const u8x32 pack_odd = _mm256_setr_epi8(
    15, 13, 11, 9, 7, 5, 3, 1, 15, 13, 11, 9, 7, 5, 3, 1,
    15, 13, 11, 9, 7, 5, 3, 1, 15, 13, 11, 9, 7, 5, 3, 1);
    ....
    const u8x32 f_0 = _mm256_slli_epi16(e_0, 12);
    const u8x32 g_0 = _mm256_or_si256(f_0, e_0);
    const u8x32 h_0 = _mm256_shuffle_epi8(g_0, pack_odd);

    rather than something like

    __m128i t3 = _mm_maddubs_epi16(v, _mm_set1_epi16(0x0110));
    __m128i t5 = _mm_packus_epi16(t3, t3);

    I’ll have to try that out. The docs say I’ll suffer some latency loss, but it could still be a win.

    1. sasuke420 says:

      Well, now that I look at what I’ve posted it looks like I am packing and bswapping at the same time, so I would need the shuffle anyway.