, 14 min read
How quickly can you check that a string is valid unicode (UTF-8)?
20 thoughts on “How quickly can you check that a string is valid unicode (UTF-8)?”
, 14 min read
20 thoughts on “How quickly can you check that a string is valid unicode (UTF-8)?”
The multibyte sequences basically start (in the high bits) with a unary count of the bytes to follow, and the following bytes all start with 10. So if each byte knows the 3 bytes to its right, it can figure out what it should look like if it’s the initial byte of a sequence. A simple bitmap is sufficient to build this branchlessly (the sequence must be contiguous, ie the first non-10 byte ends the frame, so a bitmap is needed).
A rough idea:
mask in all 10xxx bytes, map them to 0x01, 0x00 otherwise.
shift left 2x and concatenate low bits, eg 0x0101 => 0x03. (not easy in SSE)
mask off the 3-bit maps and map them to correct unary, eg 101 => 100.
In byte positions which really are initial, compare constructed unary with original input.
There are a few other parts to look for ascii and check ranges, but the multibyte structure seems solvable with this method.
I think something like what you describe can be worked out.
Another approach is sort of the reverse of that — translate the upper 4 bits to a following byte count and roll it right with saturating subtract. Output is nonzero where 01xxxxxx bytes should be.
That’s a hopeful direction!
I emailed you some code, but for others here, I implemented this method and got about 16 SSE instructions to validate that the right types of bytes (ascii, continuation, 2-3-4 byte initials) are in the right places, including the lengths of continuations.
The remaining parts are overlong sequences (basically numbers that should be encoded in fewer bytes, eg a 7-bit number encoded in a 2-byte sequence) and certain excluded ranges.
I haven’t benchmarked yet, but the work so far looks like about 1 cycle/byte. The remaining checks may push that up quite a bit however.
Yes. Thanks for the code. We’ll hopefully discuss it some more in the near future.
valid unicode byte ranges in decimal
converted with [hex2dec.py](https://stackoverflow.com/a/43545026/10440128)
First Byte | Second Byte | Third Byte | Fourth Byte
— | — | — | —
[0,127] | | |
[194,223] | [128,191] | |
224 | [160,191] | [128,191] |
[225,236] | [128,191] | [128,191] |
237 | [128,159] | [128,191] |
[238,239] | [128,191] | [128,191] |
240 | [144,191] | [128,191] | [128,191]
[241,243] | [128,191] | [128,191] | [128,191]
244 | [128,143] | [128,191] | [128,191]
Damnit, my thoughts exactly. I guess that means you must have the right idea.
If char is unsigned, as permitted, does the function is_ascii still work?
My code uses a different convention (signed char) than my prose and it can be confusing.
If you rewite the function so that it takes unsigned char inputs, as is assumed in the text, then you need to change the check for something like “is the value greater than or equal to 128”.
Otherwise it is all equivalent from a practical point of view.
This isn’t about rewriting the function to take unsigned char inputs, this is about the fact that the plain “char” inputs in your code snippet may be unsigned already (whether “char” is the same as “signed char” or “unsigned char” is not guaranteed by the standard; it is left up to the implementation in question).
So your function only works some of the time (i.e. on systems+compilers where “char” is “signed char”) and fails on others (i.e. on systems+compilers where “char” is “unsigned char”).
If you want to be pedantic then there is no reason to believe that the system is using two’s complement, which I assume… and, of course, my code assumes support for SSE 4.2 instructions which are in no way provided by the standard. I also use online assembly specific to recent x86 processors. I also assume that chars are octets, not something specified in the standard.
Don’t be so defensive, there’s no need. Someone asked a question (“will your code work where characters are unsigned”) and you misinterpreted the question and answered some other question. I clarified the question for you; no one’s trying to pick your code apart and be pedantic…
And there are plenty of systems in use where chars are unsigned. PowerPC is one of them. Also gcc has an “-funsigned-char” option that makes chars unsigned everywhere.
But that’s not the point…
That’s a reasonable point of view.
To do an initial check to see if the entire string is ascii, couldn’t you do a bitwise operation to check if all bytes match the pattern 0xxxxxxx? Like run one operation across all of them at once, so they are checking themselves (so to speak) as well as the provided mask? Or am I talking magic talk? (Which isn’t uncommon for when my limited understanding of logic meets my even more limited understanding of lower level programming.)
You are correct. What you are describing is likely faster if you expect the string to be ASCII, most of the time.
Goffart’s code uses pmovmskb to move a mask of the high bits of each char into an int for this reason. Since it’s already keyed off of the high bit, it’s one instruction.
I haven’t tried running your code but that checking for UTF-8 representation isn’t simply check the byte ranges like you said above because it has many more rules like the bytes for a code point must be the shortest possible.
For example although C0 9F is a correct 2-byte sequence according the above statement, it’s not actually valid UTF-8 because it should be represented as 6F in UTF-8 which is the shortest form. Encoding code points that are surrogate pairs is also prohibited. For example ED A0 81 is also invalid because it decodes into D801 which is a surrogate
Thanks for your comment.
0xC0 < 0xC2.
I didn’t check too closely, but the overlong sequences are all defined as having 0’s in their high bits down to the next smaller bit width, eg bits [7:10] of the 11-bit 2-byte encoding should have at least one 1 bit, and that translates to a minimum of 0xC2 in the 2-byte initial. So minimum thresholds (sometimes carried across 2 bytes) are sufficient to prevent overlongs.
The UTF-8 wiki page has a clearer table of bit patterns and forbidden sequences. https://en.wikipedia.org/wiki/UTF-8. One thing they note is that the encoded sequences sort lexicographically to the same order as the integers they encode, so ranges, maxima, etc. translate across as well.