12th August 2023, 8 min read

Transcoding UTF-8 strings to Latin 1 strings at 18 GB/s using AVX-512

12 thoughts on “Transcoding UTF-8 strings to Latin 1 strings at 18 GB/s using AVX-512”

camel-cdr says:

August 12, 2023 at 1:34 pm

I wrote a RVV version:

size_t convert_rvv(uint8_t *utf8, size_t len, uint8_t *latin1) { uint8_t *beg = latin1; uint8_t last = 0;
vuint8m4_t v, s; vbool2_t ascii, cont, leading, sleading; for (size_t vl, VL; len > 1; ) { VL = vl = __riscv_vsetvl_e8m4(len); v = __riscv_vle8_v_u8m4(utf8, vl); ascii = __riscv_vmsgtu_vx_u8m4_b2(v, 0x80-1, vl); if (__riscv_vfirst_m_b2(ascii, vl) < 0) goto skip; s = __riscv_vslide1up_vx_u8m4(v, last, vl); leading = __riscv_vmsltu_vx_u8m4_b2(__riscv_vadd_vx_u8m4(v, 0b11000010, vl), 2, vl); sleading = __riscv_vmsltu_vx_u8m4_b2(__riscv_vadd_vx_u8m4(s, 0b11000010, vl), 2, vl); cont = __riscv_vmsne_vx_u8m4_b2(__riscv_vsrl_vx_u8m4(v, 6, vl), 0b10, vl); if (__riscv_vcpop_m_b2(__riscv_vmand_mm_b2(sleading, cont, vl), vl) != __riscv_vcpop_m_b2(sleading, vl) || __riscv_vfirst_m_b2(__riscv_vmnor_mm_b2(ascii, __riscv_vmor_mm_b2(leading, cont, vl), vl), vl) >= 0) return 0; s = __riscv_vor_vv_u8m4(__riscv_vsll_vx_u8m4(__riscv_vand_vx_u8m4(v, 1, vl), 6, vl), s, vl); s = __riscv_vmerge_vvm_u8m4(v, s, ascii, vl); v = __riscv_vcompress_vm_u8m4(s, cont, vl); vl = __riscv_vcpop_m_b2(cont, vl); skip: __riscv_vse8_v_u8m4(latin1, v, vl); latin1 += vl; utf8 += VL; len -= VL; last = utf8[-1]; }
return (latin1 - (uint8_t*)beg); }

Results from a 2 GHz C920:

SWAR: 0.419896 GiB/s RVV: 3.707396 GiB/s

(I hope this isn’t a duplicate, I couldn’t tell if the previous posting got through)
1. camel-cdr says:
  
  August 12, 2023 at 1:40 pm
  
  Looks like the formatting got a bit messed up, here is a godbolt link: https://godbolt.org/z/6M7T938aE
Daniel Lemire says:

August 12, 2023 at 6:15 pm

Thanks for sharing!
-.- says:

August 13, 2023 at 4:46 am

Isn’t this just a case of moving the bottom bit of the leading byte to the 6th of the following byte, then stripping out all leading bytes?

__m512i input = _mm512_loadu_si512((__m512i *)(buf + pos)); __mmask64 leading = _mm512_cmpge_epu8_mask(input, _mm512_set1_epi8(-64)); __mmask64 bit6 = _mm512_mask_test_epi8_mask(leading, input, _mm512_set1_epi8(1)); input = _mm512_mask_sub_epi8(input, (bit6<<1) | next_bit6, input, _mm512_set1_epi8(-64)); next_bit6 = bit6 >> 63; _mm512_mask_compressstoreu_epi8((__m512i*)latin_output, ~leading, input); // WARNING: bad on Zen4

I tried putting it into the full code, and it appears to work: https://pastebin.com/Jbzm16pF
I’m not sure if the test cases can pick up all possible errors though.
1. Daniel Lemire says:
  
  August 13, 2023 at 8:57 pm
  
  Except that my full code validates the input. I don’t think your code does…
  1. Daniel Lemire says:
    
    August 13, 2023 at 8:59 pm
    
    Oh. I see that you have completed it. I will run benchmarks.
    1. Daniel Lemire says:
      
      August 13, 2023 at 9:56 pm
      
      Blog post updated. Great results.
      1. -.- says:
        
        August 14, 2023 at 2:33 am
        
        Thanks for trying it out!
        
        Daniel Lemire says:
        
        August 14, 2023 at 2:23 pm
        
        I was estimating 0.5 instructions per byte for an optimized routine but your approach is a tad better which is amazing.
Eggz says:

August 13, 2023 at 3:40 pm

oke intel fanboy. Are you done? Can we shelve complicated avx512 indoctrination completely now? Good heavens…
1. Daniel Lemire says:
  
  August 13, 2023 at 8:57 pm
  
  AMD Zen 4 has superb AVX-512 support.
2. -.- says:
  
  August 14, 2023 at 2:35 am
  
  If you have a competitive alternative, you’re welcome to post it.