Daniel Lemire's blog

, 22 min read

Validating UTF-8 bytes (Java edition)

14 thoughts on “Validating UTF-8 bytes (Java edition)”

  1. Travis Downs says:

    In principle, the (single-stream) DFA approach is limited by the L1 latency: the lookups it does in the transition table are all serially dependent from one byte to the next. This only applies to the transition lookup through (the outermost one): the other two loads (reading a byte from the string, and doing the byte value to class lookup) can be done in parallel and don’t contribute directly to the dependency chain.

    In this context the L1 latency on Intel is 5 cycles, plus one cycle for the addition operation (the s + ... part of calculating the transition lookup), so the “speed limit” here is 6 cycles. Since there is only a small amount of additional work to do within those 6 cycles (two more loads, basically), one might expect this “speed limit” and the measured speed to be very close.

    However you measure 7.5 cycles, as do I on my local system. Where are the extra 1.5 cycles coming from?

    It turns out that the form of the indexing in the loop is very important. Java is able to eliminate bounds checks from static final arrays where it can prove that the index value are always in bounds. So if restricting the index to 0-255 & 0xFF removes the bounds check if the array has at least 256 elements. This is news to me (I learned about it from a comment in the source), and pretty cool. This was only working for the first lookup (character value to class), but not for the second, so the JIT puts in a lot of bounds checking (in theory it should be only a couple of instructions, but in practice it ended up being a lot more due to some sub-optimal code generation). You can see the effect with -prof perfasm in JMH.

    If you move the & 0xFF to the outside of this indexing calculation, like this, this gets fixed and I measure about 5.96 cycles now – right at our speed limit. In fact, slightly less than our speed limit. How is that possible? Well a single call is itself a long dependency chain, but some overlap is possible between calls to isUTF8: so when you get near the end of the string, you might actually start on the next string, so it’s possible to break the speed limit slightly due to that effect. I’m testing 944 byte strings (N=191), so the effect is still noticeable but overall it goes to zero as your strings get longer. Just another pitfall to watch out for in benchmarking.

    The DFA approach has lots of room left to execute instructions but is limited by the dependency chain. One “fix” is to do run more than one DFA at once (it is in this context I said “single-stream” above): you could start both at the beginning of the string, and half way through the string and run the two DFAs in the same loop. If the JIT doesn’t go crazy it should be almost twice as fast.

    You can do more than two steams, but it will run out of steam pretty quickly: when you run out of registers, or when you bottleneck on some other execution resource.

    1. Travis Downs says:

      Here’s an example of the two-stream DFA approach:

      public static boolean isUTF8_double(byte[] b) {
      int length = b.length, half = length / 2;
      while (b[half] <= (byte)0xBF && half > 0) {
      half--;
      }
      int s1 = 0, s2 = 0;
      for (int i = 0, j = half; i < half; i++, j++) {
      s1 = utf8d_transition[(s1 + (utf8d_toclass[b[i] & 0xFF])) & 0xFF];
      s2 = utf8d_transition[(s2 + (utf8d_toclass[b[j] & 0xFF])) & 0xFF];
      }
      for (int i = half * 2; i < b.length; i++) {
      s1 = utf8d_transition[(s1 + (utf8d_toclass[b[i] & 0xFF])) & 0xFF];
      }
      return s1 == 0 && s2 == 0;
      }

      After warmup, I get a speedup of 1.98x, so it’s almost exactly twice as fast (but the first iteration behavior is slightly weird, slower for the two-stream DFA, which is normal, but faster for the single one, so the JIT is maybe making a bad decision – but in any case the speedup appears to be close to 2x regardless).

      This not tested, but it should be “approximately right” and I don’t expect fixes to hurt the performance.

      1. Post and code updated, I credit you in the updated blog post.

        Given how fast the Guava code is, it is hard to make a case for the finite-state-machine approach. The main drawback to the Guava approach are the branch mispredictions, but I was not able to make it suffer. Of course, with real data it could well happen. Or maybe someone could generate more clever synthetic data (wouldn’t be very hard as I did a hasty job).

        Still, if you expect your strings to be often “almost ASCII” then the Guava code is probably pretty good.

        1. Travis Downs says:

          I think the ns column is out of date.

          Don’t give up on the DFA entirely: the two stream version in the push request is about twice as fast and you can add more streams which should be faster yet. Perhaps it can beat the Guava approach at least on mixed text.

          For stuff that is almost entirely ASCII it’s a different game: Guava has a special loop to handle ASCII (even after it finds the first non-ASCII character), so I expect it to be fairly fast. However, other algorithms like the DFA approach can also be adopted to the same approach, although sometimes it slows down the mixed case.

          The fastest way to do the ASCII check in Java is if you can get SWAR going, which is possible but tricky – maybe I’ll post a bit on that later.

          1. I think the ns column is out of date.

            It is. I deleted it for now.

            Guava has a special loop to handle ASCII (even after it finds the first non-ASCII character)

            It does, but I am not sure it is necessarily a fruitful optimization. If you think about it, the naive code is just a loop over the characters. You first check whether it is an ASCII character, if so, you just continue the loop. If it is not, you look at more characters. That’s the general pseudocode. Maybe it is a failure of imagination on my part, but I think that there is nothing particularly optimized in the Guava code. (Note: this is not criticism. It is good code. )

            The real story is whether a finite state machine can beat a Guava-like routine when most of the data is ASCII.

            Of course, you can bypass the state machine with a fast ASCII check… but you have to patch the state.

            Probably what you have in mind is to use SWAR to check for ASCII, and if so, you just jump ahead.

            SWAR is conceptually easy, I would think… it is just a bitwise OR and a comparison. But you need to convince Java to produce the right code.

            1. Travis Downs says:

              It does, but I am not sure it is necessarily a fruitful optimization.
              If you think about it, the naive code is just a loop over the
              characters. You first check whether it is an ASCII character, if so,
              you just continue the loop.

              It might not be in all cases, but it often is, and I have a fair amount of confidence that’s probably how this “redundant” code ended up there as I have a fair amount of faith in the quality of Guava.

              Basically splitting out these simpler loops makes a better optimization target, or communicates more information to the compiler.

              The compiler usually tries optimizes loops: that is, it does more work outside the loop and reorganizes the loop so that the body of the loop is as fast as possible. If you have a separate ASCII loop, the compiler will optimize that loop which will give you (in principle) the fastest way the the compiler knows how to optimize that loop.

              Once you have a bigger loop, with several checks and exit condition, the compiler will try to optimize that whole loop the best it knows how. It doesn’t know which branches are likely to be taken. It might try to reduce branches by using conditional moves, it might reorder your branches or use another trick like a jump table. It might hoist instructions needed in most branches but not in the ASCII one. Basically it is much less likely to produce an optimal result for the thing you consider fast path.

              By splitting it out into its own loop you take advantage of the compilers natural behavior of optimizing loops kind of as a “unit”.

              A practical example that doesn’t require discussing in detail specific x86 loops would be vectorization: you could see how a small loop like the isASCII loop could be vectorized efficiently by the compiler. Actually as it happens none of them vectorize this specific loop since it has an “early exit” which makes it trickier, but it in principle it can be done. However, vectorizing the whole loop is much harder: it can also be done, but you end up with a much slower loop since it has to handle all the cases, via blends and so on. By splitting out the loop, you tell the compiler: hey, it’s OK to vectorize this part separately.

              We can run a quick check. Here are three results:

              ASCII only

              Benchmark (N) Mode Cnt Score Error Units
              Utf8Validate.testGuava_utf8 191 avgt 3 243.306 ± 0.864 ns/op
              Utf8Validate.testGuava_utf8 1910 avgt 3 2447.032 ± 30.404 ns/op
              Utf8Validate.testGuava_utf8 19100 avgt 3 24043.625 ± 148.429 ns/op
              Utf8Validate.testGuava_utf8 191000 avgt 3 239788.187 ± 2045.625 ns/op

              ASCII mostly Google style

              Benchmark (N) Mode Cnt Score Error Units
              Utf8Validate.testGuava_utf8 191 avgt 3 417.054 ± 2.246 ns/op
              Utf8Validate.testGuava_utf8 1910 avgt 3 4148.617 ± 97.589 ns/op
              Utf8Validate.testGuava_utf8 19100 avgt 3 40847.803 ± 2004.263 ns/op
              Utf8Validate.testGuava_utf8 191000 avgt 3 407459.188 ± 3612.164 ns/op

              ASCII mostly Daniel style

              Benchmark (N) Mode Cnt Score Error Units
              Utf8Validate.testGuava_utf8 191 avgt 3 2935.855 ± 53.121 ns/op
              Utf8Validate.testGuava_utf8 1910 avgt 3 28493.618 ± 346.513 ns/op
              Utf8Validate.testGuava_utf8 19100 avgt 3 281275.029 ± 3620.771 ns/op
              Utf8Validate.testGuava_utf8 191000 avgt 3 866452.130 ± 38382.579 ns/op

              This is the same benchmark as you but code, but with a modified string. ASCII only zeros the top bit so all chars are single-byte ASCII chars. Here we never drop into isWellFormedSlowPath at all, so we are testing the initial loop that you describe in your blog post above. We get about 0.66 cycles per byte – very fast!

              Then the ASCII mostly has a single Chinese (I think) character then all ASCII, so we avoid this first loop. “Google style” is the existing Guava code, and “Daniel style” is removing the loop part of isWellFormedSlowPath and adding a check if (byte > 0) continue; before the other conditions.

              We see that the existing code does benefit from the initial loop: breaking it into the fast and slow path gives a 1.7x speedup, probably because the compiler can make an even simpler loop when it doesn’t have to track the byte value and can probably unroll it.

              Removing the special case ASCII loop in the slow path produces a much bigger impact, the speedup of that trick was 7x! So here it paid off big time, for mostly ASCII strings. It changes the result from around 1.1 cycles per byte to 8 cycles per byte! This blog post might look very different if that implementation had been used.

              Such a big difference is frankly pretty weird. I could have messed up or perhaps you can blame the JIT for doing something stupid. However, I have seen this pattern repeated in high performance code and have seen the generated code myself so the general effect is real enough.

              but I think that there is nothing particularly optimized in the Guava code.

              It seems optimized to me in the sense that I don’t see what more you could do once you’ve decided on “explicit checks and branches”. They definitely seem to have gotten the non-obvious ASCII part right with two apparently redundant loops that got more than a 10x speedup. They use some clever tricks for detecting the various byte classes that take advantage of signed bytes in order to detect byte types in a single comparison that would natively take 2 comparisons or some other operation plus 1 comparison.

              The real story is whether a finite state machine can beat a Guava-like
              routine when most of the data is ASCII.

              Well my claim is sort of like if your stuff is mostly ASCII then you want special ASCII handling which will look mostly similar regardless of your slow path, so “guava like” or “DFA like” is most irrelevant at that point, at least as you approach the limit of “all ASCII”.

              Of course, you can bypass the state machine with a fast ASCII check… but you have to patch the state.

              Well the state is always zero in between valid characters, so you can just skip as many ASCII bytes as you want and the state stays zero (said another way, state 0 is mapped to state 0 every time an ASCII character occurs).

              So probably you’d do some kind of fast ASCII check and then fall back to X when you detect non-ASCII chars. Depending on how frequent non-ASCII chars actually are, what X is may vary, and also the strategy of bouncing back and forth would vary.

              1. Well the state is always zero in between valid characters, so you can just skip as many ASCII bytes as you want and the state stays zero (said another way, state 0 is mapped to state 0 every time an ASCII character occurs).

                Suppose that X1 X2 X3 is a valid 3-byte UTF-8 code point. Suppose that I insert an ASCII character A in there: X1 A X2 X3. The result is non valid. If I just skip the ASCII characters, I will not properly validate the string. So I need to do “something” to “patch the state”. This cost can be tiny, but you can’t ignore the state.

                1. Travis Downs says:

                  Well are are just using different definitions for “skip”. I don’t mean that you skip ASCII characters in such a way that previous non-adjacent characters like X1 and X2 now appear next to each other (i.e., somehow “filtering out” ASCII characters and then processing the merged string).

                  I just mean if you have consecutive ASCII characters you can skip them in the sense that you only process the characters before and after, but “before” and “after” are separate strings and the state has to be checked at the end of the “before” string. Maybe this can be called “patching” the state, but I think it probably naturally falls out of the code since you have to do more things at the transition point.

                  The game is more to balance the code in the middle region where there are “some” non-ASCII, characters, as this is common in many languages, like French which are largely ASCII but for accented characters. Large skips probably aren’t possible here, but still perhaps 90% of characters are ASCII.

              2. I already had coded a simpler version. There is a difference, but it is tiny. It seems just as fast as the Guava version in my tests, plus/minus some small margin. We seem to agree that, algorithmically, it is equivalent. I don’t doubt that it can result in different code… compilers are sensitive to how the code is written, certainly.

                Here are my exact results…

                Utf8Validate.testGuavaSimpler_utf8  191   ASCII  avgt    3   2481476.591 ±   562462.623  ns/op
                Utf8Validate.testGuavaSimpler_utf8  191    UTF8  avgt    3  19083439.790 ±   130528.023  ns/op
                Utf8Validate.testGuava_utf8         191   ASCII  avgt    3   2483272.792 ±   240088.451  ns/op
                Utf8Validate.testGuava_utf8         191    UTF8  avgt    3  19061683.123 ±   465399.508  ns/op
                

                I wrote this code before you wrote your comment, as you can check through my commits.

                I’ll repeat once more, just so that there is no misunderstanding, that I am not critical of Guava and of the quality of its code. I recommend using Guava in this case.

                1. Travis Downs says:

                  As far as I understand your code, the ASCII mode is only ASCII characters, so the “slowpath” code highlighted and that we’ve been discussing never gets executed at all, so yes it performs identically since it only ever executes identical code.

                  You need at least one non-ASCII char (logically, near the beginning of the string) to test the code path we’re discussing here.

                  1. Ah. Now I understand. Yes, I agree with you entirely then.

        2. Travis Downs says:

          I also found it surprising that Guava is “so fast”. As you mention, you’d expect branch misses to impact this algorithm.

          As it turns out they do, but your string needs to be long enough. In the N=191 (944 bytes) case, the branch predictor is able to memorize the entire pattern – there is less than one branch (on average 0.2) for the entire function. Even at N=1910 (about 9440 bytes), the predictor can memory substantially the entire function: there are on average only 10 mispredicts, or about 1 mispredict every 1000 bytes.

          Yes, TAGE is quite amazing and in fact this type of pattern is where it excels: the random nature of the branches forms an effective high-entropy “key” to look up the predictor tables, allowing many unique histories to be predicted. This is not the first time I’ve seen sequences of thousands of random branches well-predicted. On the other hand, TAGE does poorly when predicting the exit of simple fixed-count loops: if you loop 30 or 40 times and always exit at the same count, TAGE will reliably fail. In fact it is only able to predict the loop exit precisely in this case (for N=191 there are only 0.2 mispredicts, so it must get the loop exit right most of the time), because the random branches inside the loop give it enough information to key off of to predict the branch. I recommend any of the early TAGE papers: the mechanism is simple and powerful.

          Anyways, finally at N=19100 the branch prediction fails: and the performance drops substantially. The mispredicts jump to 18,000 (across about 94,400 bytes) and time jumps to 6.2 cycles per byte. The double, triple and quad stream DFA approaches are much faster, since they stay steady at about 2.4 to 2.6 cycles per byte, for all N values.

          So I think you can say that the DFA approach is better, as long as you use at least double streams, in the case the stream to validate has a heavy mix of character lengths, and is not predictable. I think in the usual case (outside of benchmarks) the “is not predictable” condition holds, but the character composition obvious depends on the application and especially the language.

          You can see the branch prediction information by appending -prof perfnorm to the command line when you run your benchmark.

          1. I have updated the code and the post to reflect this important insight.

            I could not figure out why the branch mispredictions did not kill the performance. In retrospect, it seems somewhat obvious.

            Thanks!

            1. Travis Downs says:

              I think the non-obvious part is that the branch predictor can memorize a series of 20,000 branches with almost 100% accuracy. That’s pretty interesting.