Daniel Lemire's blog

, 13 min read

Validating UTF-8 strings using as little as 0.7 cycles per byte

16 thoughts on “Validating UTF-8 strings using as little as 0.7 cycles per byte”

  1. KWillets says:

    The counter-rolling can actually be done logarithmically by shifting 1,2,4, etc., eg:

    [4,0,0,0] + ([0,4,0,0]-[1,1,1,1]) = [4,3,0,0]

    [4,3,0,0] + ([0,0,4,3]-[2,2,2,2]) = [4,3,2,1]

    but in this case the distances didn’t seem big enough to beat the linear method.

    The distances can even be larger than the register size I believe if the last value in the register is carried over to the first element of the next. It’s a good way to delineate inline variable-length encodings.

    1. KWillets says:

      We just merged a PR with this, and it is indeed faster.

  2. Christopher Chang says:

    For more fun, combine the _mm{256}_movemask_epi8() intrinsic, which lets you rapidly seek to the next non-ASCII character when there is one or validate that there aren’t any, with unaligned loads.

    1. Christopher Chang says:

      (on second thought, might be better to stick to aligned loads, less special-case code for the end of the string may be a bigger deal than special-case code for a UTF-8 code point that crosses a vector boundary.)

    2. KWillets says:

      I have something similar to that in my version — I look for counters that go off the end, and do the next unaligned load at the start of that character instead of carrying intermediate state over. Lemire found an elegant way of shifting in the needed bytes from the previous block with _mm_alignr_epi8 which is likely faster and keeps the 16-byte stride.

      There’s also a slight rewrite I did to examine all 5 bits in the initials; it turns out that splitting into ascii/non-ascii on the high bit, and then doing the mapping on the next 4 bits of the non-ascii chars allows us to cover all 5 bits correctly and do the all-ascii shortcut based on movemask. I’ll see if I can check this into the repo.

  3. Claude says:

    et pourquoi pas:

    size_t fast_validate_ascii(unsigned char* src, long len) {
    __m128i minusbyte = _mm_set1_epi8(0x80);
    __m128i current_bytes = _mm_setzero_si128();

    size_t i = 0;
    for (; i + 15 < len; i += 16) {
    //we load our section, the length should be larger than 16
    current_bytes = _mm_loadu_si128((const __m128i *)(src + i));
    if (!_mm_testz_si128(minusbyte, current_bytes))
    return i;
    }

    // last part
    if (i < len) {
    char buffer[16];
    memset(buffer, 0, 16);
    memcpy(buffer, src + i, len - i);
    current_bytes = _mm_loadu_si128((const __m128i *)buffer);
    if (!_mm_testz_si128(minusbyte, current_bytes))
    return i;
    }

    return -1;

    }
    On teste directement le bit de signe pour chaque octet. Si tous les caractères sont en ascii on renvoie -1, sinon, on renvoie la position à partir de laquelle, on sait que dans une zone de 16 caractères se cache peut-être un UTF8.

  4. Claude says:

    oups… I forgot that on certain platforms, size_t is unsigned…

    long fast_validate_ascii(unsigned char* src, long len) {
    __m128i minusbyte = _mm_set1_epi8(0x80);
    __m128i current_bytes = _mm_setzero_si128();

    long i = 0;
    for (; i + 15 < len; i += 16) {
    //we load our section, the length should be larger than 16
    current_bytes = _mm_loadu_si128((const __m128i *)(src + i));
    if (!_mm_testz_si128(minusbyte, current_bytes))
    return i;
    }

    // last part
    if (i < len) {
    char buffer[16];
    memset(buffer, 0, 16);
    memcpy(buffer, src + i, len – i);
    current_bytes = _mm_loadu_si128((const __m128i *)buffer);
    if (!_mm_testz_si128(minusbyte, current_bytes))
    return i;
    }

    return -1;

    }

    1. Your code is fine, but it is slower if you expect the content to be valid ASCII…

      validate_ascii_fast(data, N)                                    :  0.086 cycles per operation (best)    0.087 cycles per operation (avg)
      clauderoux_validate_ascii(data, N)                              :  0.106 cycles per operation (best)    0.106 cycles per operation (avg)
      

      So it becomes a data-dependent engineering issue.

      1. Claude says:

        I see… That’s quite fascinating. It means that _mm_testz_si128 is much slower than doing a “gt” comparison and a OR bitwise…

        Which is quite surprising since all this instruction does is a AND bitwise then a check on 0.

        Still more than 10 times slower is very weird…

        However, I implemented this routine in my own code, and I have a 25% improvement on checking out if a string is UTF8 or in converting it from UTF8 (or various latin table characters) to Unicode…
        It also works on Mac OS, my main platform of development. I haven’t tested yet on Windows, but I guess I should expect the same results…

        1. According to Agner Fog, the relevant instruction (ptest) counts for two muops and has a latency of 3 cycles. Note that you are also adding an extra branch (that depends on a high latency instruction).

          Still more than 10 times slower is very weird…

          What is 10 times slower than what?

          1. Claude says:

            Hem… I misread the numbers in a very ridiculous way…

  5. Visitor says:

    This perform pretty badly if the first character of a large string is invalid utf-8 because it still scans the whole string till the end.

    1. This perform pretty badly if the first character of a large string is invalid utf-8 because it still scans the whole string till the end.

      The blog post assumes that most of your data is valid UTF-8 and that it is an exceptional case when it is invalid.

      If you expect the data to be frequently invalid UTF-8, then you need to proceed differently.

  6. Antoine says:

    SIMD does not seem required for the ASCII validation function. The following non-SIMD version is almost as fast here (~0.08 cycle per operation on a Ryzen 7 with gcc 7.3.0):

    static bool validate_ascii_fast(const char *src, size_t len) {
    const char* end = src + len;
    uint64_t mask1 = 0, mask2 = 0, mask3 = 0, mask4 = 0;

    for (; src < end - 32; src += 32) {
    const uint64_t* p = (const uint64_t*) src;
    mask1 |= p[0];
    mask2 |= p[1];
    mask3 |= p[2];
    mask4 |= p[3];
    }
    for (; src < end - 8; src += 8) {
    const uint64_t* p = (const uint64_t*) src;
    mask1 |= p[0];
    }
    uint8_t tail_mask = 0;
    for (; src < end; src++) {
    tail_mask |= * (const uint8_t*) src;
    }
    uint64_t final_mask = mask1 | mask2 | mask3 | mask4 | tail_mask;
    return !(final_mask & 0x8080808080808080);
    }

    1. To prevent the compiler from using SIMD instructions, you need to insert a function attribute or disable it with a compiler flag. If you do that, I think you will find that SIMD instructions are indeed essential to get the best performance.

      https://github.com/lemire/fastvalidate-utf-8/issues/15#issuecomment-430637860

  7. Sirmabus says:

    Merci beaucoup!

    Brilliant work. The internet didn’t fail me today when I went looking for fast UTF-8 string validation.

    And poor guy, some of the comments in your other UTF-8 blog posts are funny. You express (and prove) your answers logically/mathematically, yet people humorously counter repeatedly with the least thought out counter-arguments 😛

    très apprécié.