Daniel Lemire's blog

, 8 min read

Transcoding UTF-8 strings to Latin 1 strings at 18 GB/s using AVX-512

12 thoughts on “Transcoding UTF-8 strings to Latin 1 strings at 18 GB/s using AVX-512”

  1. camel-cdr says:

    I wrote a RVV version:

    size_t convert_rvv(uint8_t *utf8, size_t len, uint8_t *latin1)
    {
    uint8_t *beg = latin1;
    uint8_t last = 0;

    vuint8m4_t v, s;
    vbool2_t ascii, cont, leading, sleading;

    for (size_t vl, VL; len > 1; ) {
    VL = vl = __riscv_vsetvl_e8m4(len);

    v = __riscv_vle8_v_u8m4(utf8, vl);
    ascii = __riscv_vmsgtu_vx_u8m4_b2(v, 0x80-1, vl);
    if (__riscv_vfirst_m_b2(ascii, vl) < 0)
    goto skip;

    s = __riscv_vslide1up_vx_u8m4(v, last, vl);

    leading = __riscv_vmsltu_vx_u8m4_b2(__riscv_vadd_vx_u8m4(v, 0b11000010, vl), 2, vl);
    sleading = __riscv_vmsltu_vx_u8m4_b2(__riscv_vadd_vx_u8m4(s, 0b11000010, vl), 2, vl);
    cont = __riscv_vmsne_vx_u8m4_b2(__riscv_vsrl_vx_u8m4(v, 6, vl), 0b10, vl);
    if (__riscv_vcpop_m_b2(__riscv_vmand_mm_b2(sleading, cont, vl), vl) != __riscv_vcpop_m_b2(sleading, vl) ||
    __riscv_vfirst_m_b2(__riscv_vmnor_mm_b2(ascii, __riscv_vmor_mm_b2(leading, cont, vl), vl), vl) >= 0)
    return 0;

    s = __riscv_vor_vv_u8m4(__riscv_vsll_vx_u8m4(__riscv_vand_vx_u8m4(v, 1, vl), 6, vl), s, vl);
    s = __riscv_vmerge_vvm_u8m4(v, s, ascii, vl);

    v = __riscv_vcompress_vm_u8m4(s, cont, vl);
    vl = __riscv_vcpop_m_b2(cont, vl);
    skip:
    __riscv_vse8_v_u8m4(latin1, v, vl);
    latin1 += vl; utf8 += VL; len -= VL;
    last = utf8[-1];
    }

    return (latin1 - (uint8_t*)beg);
    }

    Results from a 2 GHz C920:

    SWAR: 0.419896 GiB/s
    RVV: 3.707396 GiB/s

    (I hope this isn’t a duplicate, I couldn’t tell if the previous posting got through)

    1. camel-cdr says:

      Looks like the formatting got a bit messed up, here is a godbolt link: https://godbolt.org/z/6M7T938aE

  2. Thanks for sharing!

  3. -.- says:

    Isn’t this just a case of moving the bottom bit of the leading byte to the 6th of the following byte, then stripping out all leading bytes?

    __m512i input = _mm512_loadu_si512((__m512i *)(buf + pos));
    __mmask64 leading = _mm512_cmpge_epu8_mask(input, _mm512_set1_epi8(-64));
    __mmask64 bit6 = _mm512_mask_test_epi8_mask(leading, input, _mm512_set1_epi8(1));
    input = _mm512_mask_sub_epi8(input, (bit6<<1) | next_bit6, input, _mm512_set1_epi8(-64));
    next_bit6 = bit6 >> 63;
    _mm512_mask_compressstoreu_epi8((__m512i*)latin_output, ~leading, input); // WARNING: bad on Zen4

    I tried putting it into the full code, and it appears to work: https://pastebin.com/Jbzm16pF
    I’m not sure if the test cases can pick up all possible errors though.

    1. Except that my full code validates the input. I don’t think your code does…

      1. Oh. I see that you have completed it. I will run benchmarks.

        1. Blog post updated. Great results.

          1. -.- says:

            Thanks for trying it out!

            1. I was estimating 0.5 instructions per byte for an optimized routine but your approach is a tad better which is amazing.

  4. Eggz says:

    oke intel fanboy. Are you done? Can we shelve complicated avx512 indoctrination completely now? Good heavens…

    1. AMD Zen 4 has superb AVX-512 support.

    2. -.- says:

      If you have a competitive alternative, you’re welcome to post it.