Daniel Lemire's blog

, 24 min read

Emojis, Java and Strings

26 thoughts on “Emojis, Java and Strings”

  1. Krythic says:

    “I imagine this could be more computationally expensive” … clearly you have no idea what you’re talking about, and it astounds me that you felt the need to even write this out, being how erroneous it is. What’s to stop you from simply doing length / 2? Are you autistic or something? This is not ok, and quite frankly, you should be extremely embarrassed right now. If you actually knew programming you would never feel the need to write this.

    1. What’s to stop you from simply doing length / 2?

      Given an arbitrary UTF-16 string, and its length in bytes, I cannot know how many unicode characters there are without examining the content of the bytes. So no, dividing by two is not good enough. It will work in this case, but not in general.

      1. Vitaly Kravchenko says:

        Daniel, I don’t know why you bothered posting his comment, let alone replying to it. Asking if someone is autistic, telling one how one should feel, saying one doesn’t know how to program. Wow! Even if there was factual merit to what he said, I wouldn’t expect this to pass moderation 🙂

        1. Yes, the comment was abusive. But I figured that the reasoning mistake being made was interesting.

    2. Aankhen says:

      Are you autistic or something?

      I don’t think that word means what you think it means. Good attempt at unprompted flaming, though. I tip my hat to Daniel for a classy response to a blatant troll.

  2. Matt Casters says:

    Be careful, just because it is called UTF-16 or 32 does not mean 2 or 4 bytes are used per codepoint. In fact even UTF-8 can go up to 6 bytes.
    The compatibility mess was not created by Java though, it just tries to be as compatible as possible in a changing Unicode world where charAt() worked fine until the world changed.

    1. Can you elaborate? Which code points require 6 bytes in utf-8?

      1. KWillets says:

        It’s apparently a reference to FSS-UTF, pre-RFC 3629: https://en.wikipedia.org/wiki/UTF-8#History

        1. Matt Casters says:

          Below is a detailed description I read ages ago when I was trying to figure out why Java was so slow reading Strings compared to simple ASCII reading. It was when lazy conversion was implement in Kettle and parallel CSV reading because you can burn a tremendous amount of CPU cycles properly reading files from all over the world, let alone doing accurate date-time conversions, floating point number reading and so on. It put me on the wrong foot since all my IT life I was told that reading files is IO bound. In the world if ultra fast parallel disk subsystems and huge caches I can assure you all this is no longer the case. Please note the link is 15 years old, from before the emoji era, but perhaps in another 15 years Unicode will have faced other challenges.

          httpss://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

  3. Dmitry Akimov says:

    I totally second that the current state of programming is disastrous. Too bad not too many programmers seem to realize that or express an intent to do something about that.

    I think about all these string problems like that: strings are not random access, period. The fact that strings have been represented as arrays with characters as elements is yet another artifact of the programmer nerds’ ignorance, one of the series of “misconceptions programmers have about X.” With that, the humanity should have started with inventing efficient abstractions to deal with non-random access strings instead of the ugliness we see in Java and elsewhere.

    UTF-32 on its own may be considered a hack, in my opinion, as it is an incredibly wasteful representation: it consumes 4x the memory normally needed for an English string, which is kind of ridiculous. I would say, even UTF-16 is already not good with its 2x redundancy. Given that UTF-16 is both inefficient, and not random-access, it seems like a redundant solution in the presence of UTF-8.

  4. Erin Keenan says:

    But are there any good use-cases for random-access to code-points? It seems like it’ll actually just encourage bugs, since it’ll kind-of sort-of work on some things, but then break when you throw a string with combining characters at it.

    It seems reasonable, perhaps even code, for a language to not provide random access to code points.

    (Tangentially, a great thing about emojis is it flushed out a lot of apps that had shitty unicode support and forced them to fix it.)

    1. Computing substrings is a common problem… it is part of most standard APIs… No?

  5. Erin Keenan says:

    But don’t the substring algos work fine operating byte-by-byte on utf8?

    As an example, Go strings are (by convention) utf8, and provide no random access to code-points. It’s AFAIK not something people complain about, and in fact, Go’s support for unicode is generally considered pretty good. (But maybe it’s just because people are too busy complaining about other things, like missing generics!) 🙂

    1. Erin Keenan says:

      oops, this was supposed to be in reply to daniel’s post beginning “Computing substrings…”

    2. I’m not sure I understand what you are saying.

      Let us compare…

      In Python, if I want to prune the first two characters, I do…

      >>> x= "😂😍🎉👍"
      >>> x[2:]
      '🎉👍'
      

      In Swift, I do…

        var x = "😂😍🎉👍"
        var suf = String(x.suffix(2))
      

      In Go, you do…

      var x = "😂😍🎉👍"
      var suf = string([]rune(x)[2:])
      

      So I can see why people don’t complain too much about Go.

      1. Erin Keenan says:

        Well, the Go code is doing something a bit different, it’s converting the string into a []rune (aka []int32) and then slicing that. If you’re willing to convert from string into some sort of vector type, then you’re always going to have direct indexability, of course.

        But my bigger point is AFAIK is is never a good idea to index strings by code-point anyway. Your example, for example, happens to work on the input you’ve given, but breaks on other input.

        E.g., the string “mÌ€h😂😍” will not print what you expect.

        https://play.golang.org/p/iWjxjpBa-_g

        So I think it’s probably better not to have code-point indexing built-into in strings, as a gentle nudge towards useing more sophisticated algorithms when needing to do actual “character” (i.e. grapheme) level manipulations.

        1. So I think it’s probably better not to have code-point indexing built-into in strings, as a gentle nudge towards useing more sophisticated algorithms when needing to do actual “character” (i.e. grapheme) level manipulations.

          Should the language include or omit these “more sophisticated algorithms”?

          I mean… do you expect Joe programmer to figure this out on his own… Or do you think that the language should tell Joe about how to do it properly? Or should Joe never have to do string manipulations?

          I would argue that Java provides no help here. It explicitly allows you to query for the character at index j and gives you a “character” which can very well be garbage. How useful is that?

          Code points would be better. Still, I agree that code point indexing is probably not great (even though it is better that whatever Java offers) but… if you want better, why not go with user-perceived characters?

          Swift gives you this…

            1> var x = "m̀h😂😍"
          x: String = "m̀h😂😍"
            2> x.count
          $R0: Int = 4
           3> var suf = String(x.suffix(3))
          suf: String = "h😂😍"
           4> var suf = String(x.suffix(4))
          suf: String = "m̀h😂😍"
          

          What, if anything, do you not like about Swift?

          I think Swift is way ahead of the curve on this one.

          1. Erin Keenan says:

            Hey, that’s cool! I’m not a swift user, but looking up the docs, Swift is doing the correct thing, giving you “extended grapheme clusters”. Great!

            It’s just the middle-ground of giving you code-points which I’m not a fan of — it leads you toward bugs that are hard to notice.

            (I also still like Go approach of, “a string is a sequence of utf8 bytes; use a unicode library if you want fancy manipulations”. Maybe the Swift approach will turn out to be even nicer, though hard to say w/o experience using it.)

            1. Rust is a lot of fun… This does not do what I expected…

              let v = String::from("m̀h😂😍");
              let s = v.get(0..3).expect("");
              println!("{}",s);
              
              1. Erin Keenan says:

                I feel like such a philistine, since I don’t know Rust either, but that is not a surprising result to me!

                Go will give you the same.

                https://play.golang.org/p/jimB5h8WwWn

                The reason is the first two code-points are

                006D LATIN SMALL LETTER M
                0300 COMBINING GRAVE ACCENT

                (Those two code-points combine together to give you the single grapheme “mÌ€”.)

                Encoded into utf8, they become 3 bytes (109, 204, 128). So if you are treating the string as a sequence of utf8 bytes, slicing the first 3 elements would give you that.

                So it looks like Rust, like Go, takes this approach. And if you you care about fancier manipulations, you need to use a library (e.g., https://crates.io/crates/unicode-segmentation).

                As a fun aside, that string breaks a couple playgrounds:

                https://play.rust-lang.org/?gist=9958c46c59eff8d655c818e55580d202&version=undefined&mode=undefined

                https://trinket.io/python/8a0742b45e

                Try editing text after the “mÌ€”; the cursor don’t match correctly. You also can’t select the string in the Rust playground.

                The Go playground works correctly, but probably just because it uses a simple text-entry box w/o syntax highlighting or other niceities. (But would you rather have simple-but-correct or fancy-but-buggy software?)

                Finally, I managed to hang emacs by asking it to describe-char “mÌ€”.

                Unicode support is still janky in a lot of places!

                1. I have no problem understanding the result, but it is not what I expected it to do.

                  1. Erin Keenan says:

                    Sorry, I was not trying to imply you didn’t understand the result, just provide some explanation/context/motivation for the result.

                    I think what you’re saying is, “I expect a string to look like a sequence of graphemes”.

                    Whereas Go and Rust say, “a string is sequence of utf8 bytes”. So in that sense, it’s not what you expect.

                    I think the Go and Rust approach is still reasonable, since they’re likely to lead to correct software. (Vs, say, Python, which is”almost right” in the default case, making it easier to make subtly-broken software.)

                    (Come to think, perhaps a better test-case to give you would’ve been “👷‍♀️👩‍⚕️🎉👍”.)

                    The Swift approach seems reasonable too, and maybe even better since it does the right thing by default, though at the cost that you’ve got a lot of unicode complexity in your core string class, and it’s non-obvious (at least to me) what your internal string represenation is going to be, or what the perf cost of various operations is going to be. (E.g., is something like “.count” on a swift string constant time, or does it have to run through the whole string calculating the graphemes?)

                    1. I think the Go and Rust approach is still reasonable, since they’re likely to lead to correct software.

                      In what sense?

                      You are still left to do things like normalization on your own. This makes it quite hard to do correct string searchers in Go, say.

                      Try this:

                      package main

                      import (
                        "fmt"
                         "strings"
                      )
                      

                      func main() { var x = "Pok\u00E9mon" var y = "Poke\u0301mon" fmt.Println("are ", x, " and ", y , " equal/equivalent?") fmt.Println(x == y) fmt.Println(strings.Compare(x,y)) }

                      Sure, you can remember to use a unicode library as you say and never rely on the standard API to do string processing, but Go does not help you. If you don’t know about normalization, and try to write a search function in Go, you will get it flat wrong, I bet.

  • Erin Keenan says:

    (I think I hit the nesting depth limit for replies; this is a reply to Daniel’s sibling comment at 4:04.)

    That is a fair point, but I would not view the situation as dimly as you do.

    I would say software like that has sharp-edges, rather than being incorrect. If I, as a user, normalize my input before handing it off to the software, it will function correctly. This is how emacs works, for example. It is an annoyance occasionally, but not in my mind a “bug” per se.

    Compare this situation to the two code playgrounds I posted above.

    Once you include a multi-code-point grapheme in your input, they stop working correctly, full stop. The character insert offset is shown incorrectly, and text selection using the mouse is glitchy. There is nothing you can do as a user to avoid this.

    So that’s the style of bug that’s encouraged by the “almost correct” perspective of a string as a sequence of code-points.

    I take your point, though, that Swift’s perspective of a string as a sequence of graphemes may be the superior approach, avoiding both types of undesirable behavior.

    (Though I guess at some perf & complexity price.)

    So going back to your original post, in my view, Python 3’s behavior is bad, Go and Rust are ok, and Swift is (maybe) the best.

  • Bart Wiegmans says:

    Interesting blog post.

    I wanted to point out that MoarVM (the Perl6 VM) uses a string representation called ‘normalized form grapheme’ that allows efficient random access on unicode grapheme strings. Link to documentation.

    The essence of that trick is to combine all combinators into a grapheme and map that to a synthetic codepoint (I believe a negative number), which is then ‘unmapped’ when encoded to an external format (e.g.., UTF-8).

    This is obviously not perfect as it incurs extra cost at IO, although that is true of any system that uses anything other than UTF-8 internally. So I think it is a nice solution (until unicode runs out of 31 bit space, that is).

  • soft says:

    Hi,

    I want to show color emoji (😄) by using javafx. I am not able to show color emoji but I am able to show black and white emoji.
    so please suggest me is it possible to show color emoji in javafx. i am using Segoe UI Emoji font.
    Thanks