16th February 2023, 6 min read

Computing the UTF-8 size of a Latin 1 string quickly (AVX edition)

7 thoughts on “Computing the UTF-8 size of a Latin 1 string quickly (AVX edition)”

Stefan Kanthak says:

February 17, 2023 at 7:54 am

“it suffices to count two output bytes for each input byte having the most significant bit set, and one output byte for other bytes.”

This works until your HTML5 compliant WWW browser (and derived applications) treats ISO-8859-1 as Windows-1252.
Rici Lake says:

February 17, 2023 at 7:43 pm

Windows-1252 contains a number of characters whose codes exceed the 11-bit limit for two-octet UTF-8. Common example: €, the Euro sign. (There are others.)

ISO-8859-1 only contains codepoints with one- and two-octet UTF-8 sequences, but you’ll find pages labelled as Latin1 which are actually Windows-1252 or even Latin15 (also with €).
1. Daniel Lemire says:
  
  February 17, 2023 at 8:34 pm
  
  I stand corrected.
  
  Even so, if you are incorrectly identifying Windows-1252 as Latin 1 then you simply cannot convert it to Unicode properly, and there is no sure way to correct and identify a mislabel of the sort.
  1. Stefan Kanthak says:
    
    February 18, 2023 at 3:54 am
    
    You can’t and MUST NO trust a designation “charset=ISO-8859-1”; if your code is used to determine the size of buffer allocations, this can result in buffer overruns. Who will assign a CVE then?
    
    About 28 years ago, when adding MIME support to Windows, some imbeciles at Microsoft assigned the Codepage “CP_1252” to “charset=ISO-8859-1”:
    
    [HKEY_LOCAL_MACHINE\SOFTWARE\Classes\MIME\DataBase\CharSet\iso-8859-1] "CodePage"=dword:000004e4
    
    From then, every piece of (not just) their crap, like FrontPage or Word, which composes mail or HTML pages AND relies on Windows MIME database, labels text encoded with CP_1252 as ISO-8859-1 (the other “charset=ISO-8859-” have also wrong codepages assigned).
    Years later, Microsoft registered several “charset=Windows-” with IANA, but never fixed their wrong database — a REAL shame.
    When the W3C standardized HTML5, to avoid a misrepresentation of HTML pages generated with Microsoft crap, they specified that ISO-8859-1 must be handled as Windows-1252.
Yaffle says:

February 18, 2023 at 7:30 pm

Do you know what performance will those functions have with 64BM strings (when cpu cache is much smaller then the string length) ?
1. Daniel Lemire says:
  
  February 19, 2023 at 12:34 pm
  
  If you are dealing with large strings, you probably want to fragment the problem: process substrings. See this blog post where I explain that working with large inputs all at once is a performance anti-pattern: https://lemire.me/blog/2021/08/21/the-big-load-anti-pattern/
NRK says:

February 19, 2023 at 7:44 am

Great read as usual!

Out of curiosity, I went ahead and wrote a SWAR version using uint_fast32_t. Going by cycles/byte – it’s about 3.5x faster than the autovectorized scalar version on my zen2 machine.

size_t swar_utf8_len(const uint8_t *s, size_t len) { typedef uint_fast32_t V; static_assert(sizeof(V) <= sizeof(uint64_t)); size_t mask = V{0x80808080} | (V{0x80808080} << 16 << 16); size_t answer = 0; size_t swar_rounds = len / sizeof(V); size_t swar_bytes = swar_rounds * sizeof(V); for (size_t i = 0; i < swar_bytes; i += sizeof(V)) { V v; memcpy(&v, s + i, sizeof v); v &= mask; answer += std::popcount(v); } return len + answer + scalar_utf8_length(s + swar_bytes, len - swar_bytes); }

(Note that std::popcount requires c++20.)

3.5x speed up from just dozen lines of (mostly) portable code seems like a good deal!