15th June 2018, 24 min read

Emojis, Java and Strings

26 thoughts on “Emojis, Java and Strings”

Krythic says:

June 15, 2018 at 11:37 pm

“I imagine this could be more computationally expensive” … clearly you have no idea what you’re talking about, and it astounds me that you felt the need to even write this out, being how erroneous it is. What’s to stop you from simply doing length / 2? Are you autistic or something? This is not ok, and quite frankly, you should be extremely embarrassed right now. If you actually knew programming you would never feel the need to write this.
1. Daniel Lemire says:
  
  June 16, 2018 at 12:33 am
  
  What’s to stop you from simply doing length / 2?
  
  Given an arbitrary UTF-16 string, and its length in bytes, I cannot know how many unicode characters there are without examining the content of the bytes. So no, dividing by two is not good enough. It will work in this case, but not in general.
  1. Vitaly Kravchenko says:
    
    June 17, 2018 at 8:55 pm
    
    Daniel, I don’t know why you bothered posting his comment, let alone replying to it. Asking if someone is autistic, telling one how one should feel, saying one doesn’t know how to program. Wow! Even if there was factual merit to what he said, I wouldn’t expect this to pass moderation 🙂
    1. Daniel Lemire says:
      
      June 18, 2018 at 5:26 am
      
      Yes, the comment was abusive. But I figured that the reasoning mistake being made was interesting.
2. Aankhen says:
  
  June 16, 2018 at 12:17 pm
  
  Are you autistic or something?
  
  I don’t think that word means what you think it means. Good attempt at unprompted flaming, though. I tip my hat to Daniel for a classy response to a blatant troll.
Matt Casters says:

June 16, 2018 at 3:40 pm

Be careful, just because it is called UTF-16 or 32 does not mean 2 or 4 bytes are used per codepoint. In fact even UTF-8 can go up to 6 bytes.
The compatibility mess was not created by Java though, it just tries to be as compatible as possible in a changing Unicode world where charAt() worked fine until the world changed.
1. Daniel Lemire says:
  
  June 16, 2018 at 3:50 pm
  
  Can you elaborate? Which code points require 6 bytes in utf-8?
  1. KWillets says:
    
    June 16, 2018 at 5:37 pm
    
    It’s apparently a reference to FSS-UTF, pre-RFC 3629: https://en.wikipedia.org/wiki/UTF-8#History
    1. Matt Casters says:
      
      June 16, 2018 at 7:05 pm
      
      Below is a detailed description I read ages ago when I was trying to figure out why Java was so slow reading Strings compared to simple ASCII reading. It was when lazy conversion was implement in Kettle and parallel CSV reading because you can burn a tremendous amount of CPU cycles properly reading files from all over the world, let alone doing accurate date-time conversions, floating point number reading and so on. It put me on the wrong foot since all my IT life I was told that reading files is IO bound. In the world if ultra fast parallel disk subsystems and huge caches I can assure you all this is no longer the case. Please note the link is 15 years old, from before the emoji era, but perhaps in another 15 years Unicode will have faced other challenges.
      
      httpss://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Dmitry Akimov says:

June 16, 2018 at 7:37 pm

I totally second that the current state of programming is disastrous. Too bad not too many programmers seem to realize that or express an intent to do something about that.

I think about all these string problems like that: strings are not random access, period. The fact that strings have been represented as arrays with characters as elements is yet another artifact of the programmer nerds’ ignorance, one of the series of “misconceptions programmers have about X.” With that, the humanity should have started with inventing efficient abstractions to deal with non-random access strings instead of the ugliness we see in Java and elsewhere.

UTF-32 on its own may be considered a hack, in my opinion, as it is an incredibly wasteful representation: it consumes 4x the memory normally needed for an English string, which is kind of ridiculous. I would say, even UTF-16 is already not good with its 2x redundancy. Given that UTF-16 is both inefficient, and not random-access, it seems like a redundant solution in the presence of UTF-8.
Erin Keenan says:

June 18, 2018 at 4:29 pm

But are there any good use-cases for random-access to code-points? It seems like it’ll actually just encourage bugs, since it’ll kind-of sort-of work on some things, but then break when you throw a string with combining characters at it.

It seems reasonable, perhaps even code, for a language to not provide random access to code points.

(Tangentially, a great thing about emojis is it flushed out a lot of apps that had shitty unicode support and forced them to fix it.)
1. Daniel Lemire says:
  
  June 18, 2018 at 4:48 pm
  
  Computing substrings is a common problem… it is part of most standard APIs… No?
Erin Keenan says:

June 18, 2018 at 5:13 pm

But don’t the substring algos work fine operating byte-by-byte on utf8?

As an example, Go strings are (by convention) utf8, and provide no random access to code-points. It’s AFAIK not something people complain about, and in fact, Go’s support for unicode is generally considered pretty good. (But maybe it’s just because people are too busy complaining about other things, like missing generics!) 🙂
1. Erin Keenan says:
  
  June 18, 2018 at 6:26 pm
  
  oops, this was supposed to be in reply to daniel’s post beginning “Computing substrings…”
2. Daniel Lemire says:
  
  June 18, 2018 at 7:09 pm
  I’m not sure I understand what you are saying.
  
  Let us compare…
  
  In Python, if I want to prune the first two characters, I do…
```
>>> x= "ðŸ˜‚ðŸ˜ðŸŽ‰ðŸ‘"
>>> x[2:]
'ðŸŽ‰ðŸ‘'
```
  In Swift, I do…
```
  var x = "ðŸ˜‚ðŸ˜ðŸŽ‰ðŸ‘"
  var suf = String(x.suffix(2))
```
  In Go, you do…
```
var x = "ðŸ˜‚ðŸ˜ðŸŽ‰ðŸ‘"
var suf = string([]rune(x)[2:])
```
  So I can see why people don’t complain too much about Go.
  1. Erin Keenan says:
    
    June 18, 2018 at 9:12 pm
    
    Well, the Go code is doing something a bit different, it’s converting the string into a []rune (aka []int32) and then slicing that. If you’re willing to convert from string into some sort of vector type, then you’re always going to have direct indexability, of course.
    
    But my bigger point is AFAIK is is never a good idea to index strings by code-point anyway. Your example, for example, happens to work on the input you’ve given, but breaks on other input.
    
    E.g., the string “mÌ€hðŸ˜‚ðŸ˜” will not print what you expect.
    
    https://play.golang.org/p/iWjxjpBa-_g
    
    So I think it’s probably better not to have code-point indexing built-into in strings, as a gentle nudge towards useing more sophisticated algorithms when needing to do actual “character” (i.e. grapheme) level manipulations.
    1. Daniel Lemire says:
      
      June 18, 2018 at 9:37 pm
      So I think it’s probably better not to have code-point indexing built-into in strings, as a gentle nudge towards useing more sophisticated algorithms when needing to do actual â€œcharacterâ€ (i.e. grapheme) level manipulations.
      
      Should the language include or omit these “more sophisticated algorithms”?
      
      I mean… do you expect Joe programmer to figure this out on his own… Or do you think that the language should tell Joe about how to do it properly? Or should Joe never have to do string manipulations?
      
      I would argue that Java provides no help here. It explicitly allows you to query for the character at index j and gives you a “character” which can very well be garbage. How useful is that?
      
      Code points would be better. Still, I agree that code point indexing is probably not great (even though it is better that whatever Java offers) but… if you want better, why not go with user-perceived characters?
      
      Swift gives you this…
      
      1> var x = "mÌ€hðŸ˜‚ðŸ˜" x: String = "mÌ€hðŸ˜‚ðŸ˜" 2> x.count $R0: Int = 4 3> var suf = String(x.suffix(3)) suf: String = "hðŸ˜‚ðŸ˜" 4> var suf = String(x.suffix(4)) suf: String = "mÌ€hðŸ˜‚ðŸ˜"
      
      What, if anything, do you not like about Swift?
      
      I think Swift is way ahead of the curve on this one.
      1. Erin Keenan says:
        
        June 18, 2018 at 9:53 pm
        
        Hey, that’s cool! I’m not a swift user, but looking up the docs, Swift is doing the correct thing, giving you “extended grapheme clusters”. Great!
        
        It’s just the middle-ground of giving you code-points which I’m not a fan of — it leads you toward bugs that are hard to notice.
        
        (I also still like Go approach of, “a string is a sequence of utf8 bytes; use a unicode library if you want fancy manipulations”. Maybe the Swift approach will turn out to be even nicer, though hard to say w/o experience using it.)
        
        Daniel Lemire says:
        
        June 18, 2018 at 10:19 pm
        
        Rust is a lot of fun… This does not do what I expected…
        
        let v = String::from("mÌ€hðŸ˜‚ðŸ˜"); let s = v.get(0..3).expect(""); println!("{}",s);
        
        Erin Keenan says:
        
        June 18, 2018 at 10:50 pm
        
        I feel like such a philistine, since I don’t know Rust either, but that is not a surprising result to me!
        
        Go will give you the same.
        
        https://play.golang.org/p/jimB5h8WwWn
        
        The reason is the first two code-points are
        
        006D LATIN SMALL LETTER M 0300 COMBINING GRAVE ACCENT
        
        (Those two code-points combine together to give you the single grapheme “mÌ€”.)
        
        Encoded into utf8, they become 3 bytes (109, 204, 128). So if you are treating the string as a sequence of utf8 bytes, slicing the first 3 elements would give you that.
        
        So it looks like Rust, like Go, takes this approach. And if you you care about fancier manipulations, you need to use a library (e.g., https://crates.io/crates/unicode-segmentation).
        
        As a fun aside, that string breaks a couple playgrounds:
        
        https://play.rust-lang.org/?gist=9958c46c59eff8d655c818e55580d202&version=undefined&mode=undefined
        
        https://trinket.io/python/8a0742b45e
        
        Try editing text after the “mÌ€”; the cursor don’t match correctly. You also can’t select the string in the Rust playground.
        
        The Go playground works correctly, but probably just because it uses a simple text-entry box w/o syntax highlighting or other niceities. (But would you rather have simple-but-correct or fancy-but-buggy software?)
        
        Finally, I managed to hang emacs by asking it to describe-char “mÌ€”.
        
        Unicode support is still janky in a lot of places!
        
        Daniel Lemire says:
        
        June 18, 2018 at 11:24 pm
        
        I have no problem understanding the result, but it is not what I expected it to do.
        
        Erin Keenan says:
        
        June 19, 2018 at 12:01 am
        
        Sorry, I was not trying to imply you didn’t understand the result, just provide some explanation/context/motivation for the result.
        
        I think what you’re saying is, “I expect a string to look like a sequence of graphemes”.
        
        Whereas Go and Rust say, “a string is sequence of utf8 bytes”. So in that sense, it’s not what you expect.
        
        I think the Go and Rust approach is still reasonable, since they’re likely to lead to correct software. (Vs, say, Python, which is”almost right” in the default case, making it easier to make subtly-broken software.)
        
        (Come to think, perhaps a better test-case to give you would’ve been “ðŸ‘·â€â™€ï¸ðŸ‘©â€âš•ï¸ðŸŽ‰ðŸ‘”.)
        
        The Swift approach seems reasonable too, and maybe even better since it does the right thing by default, though at the cost that you’ve got a lot of unicode complexity in your core string class, and it’s non-obvious (at least to me) what your internal string represenation is going to be, or what the perf cost of various operations is going to be. (E.g., is something like “.count” on a swift string constant time, or does it have to run through the whole string calculating the graphemes?)
        
        Daniel Lemire says:
        
        June 19, 2018 at 4:04 am
        
        I think the Go and Rust approach is still reasonable, since they’re likely to lead to correct software.
        
        In what sense?
        
        You are still left to do things like normalization on your own. This makes it quite hard to do correct string searchers in Go, say.
        
        Try this:
        
        package main
        
        import ( "fmt" "strings" )
        func main() { var x = "Pok\u00E9mon" var y = "Poke\u0301mon" fmt.Println("are ", x, " and ", y , " equal/equivalent?") fmt.Println(x == y) fmt.Println(strings.Compare(x,y)) }
        
        Sure, you can remember to use a unicode library as you say and never rely on the standard API to do string processing, but Go does not help you. If you don’t know about normalization, and try to write a search function in Go, you will get it flat wrong, I bet.