Daniel Lemire's blog

, 45 min read

Sorting strings properly is stupidly hard

42 thoughts on “Sorting strings properly is stupidly hard”

  1. Usman Sharif says:

    C# has a concept of Invariant Culture for string comparisons. It is very useful for the exact issue you are running into. See https://docs.microsoft.com/en-us/dotnet/api/system.globalization.cultureinfo.invariantculture?view=netframework-4.7.2

  2. Jonathan says:

    Python has the very strong natsort module for this. It has a lot of special cases for things like thousands separators, character accents and character case.

    https://natsort.readthedocs.io/en/master/howitworks.html

    1. As far as I can tell, this is not part of the standard library.

    2. Thomas Mueller Graf says:

      IMHO “natural sort” solves a slightly different problem: the problem of sorting text that contains numbers. For example “1000 apples” should be sorted after “200 apples”. I see the library you refer to supports both numbers, as well as locale-specific sorting. Another related problem is sorting date values such as “Mar/29/2018” correctly. For log files, the easiest solution is to the use the format “2018-03-29” (year, month, day), which avoid the problem. Trying to “detect” what is meant is quite hard.

  3. Here it is in R (in my default en_NZ.UTF-8 locale)

    > v<-c("e","a","é","f")
    > sort(v)
    [1] "a" "e" "é" "f"
    > v<-c("a","b","A","B")
    > sort(v)
    [1] "a" "A" "b" "B"

    I note that some languages do sort accented vowels at the end, eg Ø and Å in Danish. And Estonian not only sorts accented vowels to nearly the end of the alphabet, but also puts U after Z.

    As you say, it’s hard.

  4. Tomasz Jamroszczak says:

    The problem is, sorting letters and words depends on language. Until 2006 Swedish V letter was equivalent to W. Å, Ä, and Ö are at the end of the alphabet.

    Sorting Chinese is hard – some sort the characters by the number of brush strokes it’s needed to paint it.

  5. Adynatos says:

    oh yeah? how do you sort emojis?
    following unicode seems to at least be sane.

    1. If standard libraries made it is easy to sort according to the Unicode Collation Algorithm (a standard) then I’d be fine with it. That’s not what they do.

      1. Travis Downs says:

        While you be aware of this, it is worth pointing out that the Unicode Collation Algorithm is not an algorithm that takes two strings s1 and s2 and sorts them. It takes s1, s2 and a locale and sorts s1 and s2 according to the locale.

        So the UCA is in no way a drop-in replacement for a plain s1 < s2 type comparison, because it implies the complicated question “what locale do you want to do this sort in”?

        I 100% agree that languages should provide locale sensitive sorts, and I would be surprised if most mainstream, modern ones don’t – but it’s hard to argue it should be the default.

        There is a “default” collation order available for UCA, but using this is unlikely to make everyone happy (the concept of a “default” locale is likely to raise some hackles) and you’ll get a slowdown for no good reason – still, using the default UCA locale as the language-level default sort is a better option than using the implicit environment locale, although worse than “binary”, IMO.

        1. I 100% agree that languages should provide locale sensitive sorts, and I would be surprised if most mainstream, modern ones don’t

          Have a look at the answers I got so far:

          1. “Go has x/text which may be included in the standard library (…)” Not exactly encouraging, is it?
          2. Regarding Python, a commenter pointed me to a non-standard library (natsort), and there is questions as to whether it solves the right problem. The locale package seems like a more concrete answer, but it fails on my machine (as the commenter anticipates) and requires non-obvious (to me) invocations like locale.setlocale(locale.LC_ALL, 'fr_FR'). Granted, programming is hard, but some things are more obvious than others.
          3. For Swift, one commenter refers to Objective-C. But the documentation regarding strings in the current Swift does not say anything about collation. The string class has no helpful attribute that I can spot.

          I submit to you that a tricky technical interview question would be “how do you sort unicode strings in Python”. Then “how do you sort unicode strings in Swift”. Even with access to the Internet, I suspect that even good engineers would fail to produce a working solution.

          This should be clearly documented and accessible. It is not.

          I don’t expect C to provide this sort of things. One commenter points out that C++ does. I think Java and C# do. For Swift 4, the answer is unknown to me at this point. For Go, the answer seems to be “no, the language does not support it”. But even when languages do support it, why can’t it just be something straight-forward and documented like thisarray.sort(ThisLocale.comparator)?

          1. Travis Downs says:

            Huh, the situation is worse than I expected (I don’t use those languages regularly, or at least I don’t care about collation order when I use them).

            That said, it seems at least Go and Swift do support this, even if the documentation is poor. An example for Swift in Swedish locale, and as far as x/text in Go goes, it seems to me that the /x/ libraries are indeed “part of Go” but not part of “Go core development”. I.e., the X are classification means a different development process and compatibility requirements but they are still “included”. Kind of weird some for something as basic as collation.

            One reason might be the ubiquity of ICU. As far as I know this is the widely available “go to” library for full-on collation and the one I’ve ended up using in any serious setting. So maybe people prefer to just use the consistent ICU library than language facilities for this?

            On Python, the answer seems to be the “locale” package as you point out, but it seems not cross-platform (i.e., the locales are named differently on Windows and Linux). I guess it would at least work for the “implicit locale” case though. Maybe there is yet another way to get standard locale names? Seems messy.

            1. The Swift example you offer is for old Swift versions. I am sure it is possible to do with the current Swift version, but the approach described there does not work.

              That is, it is not that Swift does not support it, it is that it is stupidly hard to find how to do it.

              1. Travis Downs says:

                Based on my search, I agree with you. It is both poorly documented and the Swift community is apparently not big enough, or doesn’t care enough (or not well-indexed enough with enough SEO) to provide yet an an easy to find answer for newer versions.

          2. Travis Downs says:

            I replied here but it didn’t show up, maybe lost in moderation?

            1. Hmmm? I checked and all the comments I have from you are approved. BTW did you know that you have 138 comments on my blog. True fact. (Obviously, I am appreciative.)

              1. Travis Downs says:

                It showed up, it was the comment that mentioned ICU above this one. Feel free to delete this branch.

  6. Malte says:

    With Apple’s Objective-C (in the Foundation framework I guess) you’d use -localizedCompare: with NSArray’s -sortedArrayUsingSelector: (or -sortUsingSelector: for in-place sort of mutable arrays).

    The comparator uses the user’s current locale. There’s a bunch of string comparison methods that allow you to use specific locales and other options.

    I believe there are corresponding APIs in Swift.

    1. I believe there are corresponding APIs in Swift.

      Do you happen to have a reference to localizedCompare in Swift 4. I looked briefly but did not find it.

      1. Malte says:

        Sorry for the late reply! Holidays and stuff.

        Hmm, googling gives me this in Apple’s online documentation. Using the comparator boils down to, e.g.,

        let array = ["a","B","b","è","A","e"]
        let sorted = array.sorted {
        $0.localizedCompare($1) == .orderedAscending
        }

        Does that work for you?

        1. Your code sample assumes that the String struct in Swift has a method called localizedCompare. It does not:

          https://developer.apple.com/documentation/swift/string

          However, I have finally figured out how to do it:

          import Foundation

          ["a","e","é","f"].sorted {($0 as NSString).localizedCompare(($1)) == .orderedAscending}

          The trick is to “cast” the standard Swift strings to NSString. The standard Swift strings do not support localizedCompare.

          To be clear, this means that localized comparison is not part of the Swift standard library, you need “Foundation” and you need to cast your strings over to Foundation’s NSString. Importantly, this means that if you search the standard library, you will not find out how to do localized comparison.

          Thus localized comparisons are clearly viewed as outside of the core functionality of the language.

          I think that’s wrong.

          1. Xiaodi Wu says:

            You should not have to bridge String to NSString in order to use localized comparison. As a special rule, methods on NSString magically appear as available for String when you import Foundation. This is because Foundation is a core library and its features for strings are essential to the language.

  7. Brian Kessler says:

    Go has x/text which may be included in the standard library in the future.
    That includes collation for many languages. I believe that the language.Und (default language) uses the default collator.

    package uca

    import (
    "testing"

    "golang.org/x/text/collate"
    "golang.org/x/text/language"
    )

    func TestUCASort(t *testing.T) {
    v := []string{"e", "a", "é", "f"}
    c := collate.New(language.Und)
    c.SortStrings(v) // [a e é f]
    t.Log(v)
    v = []string{"a", "b", "A", "B"}
    c.SortStrings(v) // [a A b B]
    t.Log(v)
    }

    Additionally, the python standard library uses locale for localization and string collation.

    import locale

    locale.setlocale(locale.LC_ALL, '')
    x=["e","a","é","f"]
    x.sort(key=locale.strxfrm)

    However, I tried to test out the sorting on fr_ca locale and got the incorrect answer, which I found out was due to incorrect locale settings on Max OS X/BSD. On my machine, fr_FR.UTF-8 collation is linked to la_LN.US-ASCII

    ls -l /usr/share/locale/fr_FR.UTF-8/
    lrwxr-xr-x 1 root wheel 28 Oct 31 10:37 LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
    lrwxr-xr-x 1 root wheel 17 Oct 31 10:37 LC_CTYPE -> ../UTF-8/LC_CTYPE
    drwxr-xr-x 3 root wheel 96 Oct 31 10:37 LC_MESSAGES
    lrwxr-xr-x 1 root wheel 30 Oct 31 10:37 LC_MONETARY -> ../fr_FR.ISO8859-1/LC_MONETARY
    lrwxr-xr-x 1 root wheel 29 Oct 31 10:37 LC_NUMERIC -> ../fr_FR.ISO8859-1/LC_NUMERIC
    -r--r--r-- 1 root wheel 364 Aug 17 15:58 LC_TIME

    So the issue with string sorting goes beyond programming languages to properly handling collation in the OS as well. Sorting strings is hard…

    1. Travis Downs says:

      Isn’t x/text already part of the standard library? At least, aren’t the x/ packages included with Go but simply “developed apart from Go core” and having some different compatibility requirements?

      Said another way, if I “install Go” does it come with x/text?

      (in case it’s not clear, I am not very familiar with Go)

      1. The list of packages in the Go standard library is available there:

        https://golang.org/pkg/#stdlib

        It does not include the x packages.

        You can find the x packages there:

        https://golang.org/pkg/#other

        Here is the comment that describes them:

        These packages are part of the Go Project but outside the main Go tree. They are developed under looser compatibility requirements than the Go core. Install them with “go get”.

        1. The Go collation code is available on GitHub. It is short but filled with scary “TODOs”: https://github.com/golang/text/blob/master/collate/collate.go#L205

        2. Travis Downs says:

          Well, I guess the best that can be said for Go then is that there is support for locale-aware collation “in the Go project” but not as part of the standard library. Maybe one day? At least they have the excuse of being a relatively young language (but I don’t really buy it because it also puts them in the “modern” category which means they should get this stuff right from day 1). Perhaps the excuse is that Go is more minimalist compared to other languages.

          There is also an argument to be made for how much belongs in the stdlib and how much is left to outside libraries – but I would personally think that basic locale support, including collation should be in the stdlib.

          1. There is also an argument to be made for how much belongs in the stdlib and how much is left to outside libraries – but I would personally think that basic locale support, including collation should be in the stdlib.

            Sorting, comparing and hashing strings in a natural language way definitively belongs to the standard library. I can excuse C, for obvious reasons… but other languages should support this very well.

            I could live with a language that exports this to a well made external library, but it should be clearly documented and easy to find.

            It is the kind of things that are almost impossible to build on your own but that everyone needs to get right from time to time. And when you do need to do it, you should not need to spend days on it.

            Note that the default sort does not even work for ascii-only English. And English does use accents!

            1. Travis Downs says:

              Note that the default sort does not even work for ascii-only English.
              And English does use accents!

              Well the crux of my original argument was that it does “work” – because using binary collation is totally reasonable for the “default sort” (i.e., also for default comparison and equality and hashing).

              So there are two separate points of contention here:

              1) Is the default comparison (equality, hashing, etc) behavior of programming languages reasonable? (eg that AaBb is sorted as ABab)

              2) When locale-aware sorting is wanted, is it available and documented?

              I’m willing to concede the answer certainly appears to be “no” for (2) at least for many mainstream languages, and I was wrong here since I had assumed based on my experience from other languages that everyone is getting this right today.

              I don’t agree with you on (1) at all though! You appear to be arguing that in the absence of any global agreement on how to collate, the languages should at least try to get some bits right for English and I guess French speakers and also perhaps indirectly for languages that share some of those conventions, and perhaps also for other languages whose character sets don’t overlap?

              I don’t agree: there is no good default order, so at this point you might as well just use whatever lexicographic sort order is fastest which is what most languages do. Maybe there is an argument for making it “obviously wrong” for human consumption, by doing some xor on the letters or something, but that seems a bit extreme!

              I find this reasoning confusing:

              Human beings understand that the characters e, é, E, É, è, ê, and so
              forth, should be considered as the same letter (e) with accents. There
              are exceptions to this rule, but the default which consists in sorting
              accentuated characters after the letter ‘z’ is just not reasonable.

              Actually human beings do not agree on that! You mention “there are exceptions” – but the exceptions are not some weird letters that everyone agrees should be sorted differently, but that everyone doesn’t even agree how the same letters should be sorted. Just take a look at this list.

              You mention accents, as an obvious case – but a quick scan of that list shows it may not be so obvious at all. There seem to be at least four different strategies that I can pick out: treating an character with a diacritic the same as the unadorned letter (except for tie breaking?), treating it as coming after the unadorned letter, putting it somewhere else in the alphabet entirely (usually after z for Latin-using scripts), or the “French way” which is apparently to treat accented characters the same as unadorned, but to tie break on the accented characters starting from the right end of the string working backwards (this is news to me and I’ve spent my fair share of time looking up words in a French dictionary).

              You do not want this mess in your default string ordering (where it can also infect equality, hashing, etc)!

              You also do not want to be the one choosing between German or French ways of collating, or Russia or Ukrainian or any other list of hard choices for your “best effort global collation to rule them all”. You do not want to be the one who got it working “alright” for French but left out every Asian script, etc.

              Finally, and something I forgot before, proper collating breaks all sorts of relationships you’d expect of a “normal” lexicographic sort. For example, if for two strings a, b you have a < b (and a not a prefix of b), you would expect it to be true if you append some additional (perhaps identical) characters to a and b since after all that’s how lexicographic compare works, but proper collation doesn’t follow this rule and many similar expected relationships (at least it is transitive though, when properly implemented, or all hell would really break loose).

  8. Handled properly in C++ std library https://en.cppreference.com/w/cpp/locale/collate.

  9. Travis Downs says:

    I disagree.

    While it’s true that “default” sorts don’t make sense in a lot of ways, there is no single sort that makes sense for everyone, because different languages have different ways of sorting.

    So while you say:

    You might prefer A to come before a, or vice versa, but no human being would ever sort the letters as A,B,a,b or a,b,A,B.

    This claim may not be true, although a few minutes of Googling dind’t turn up a counter example, although most other obvious orderings are broken by this language. Even if it is true for A and B, it is at definitely not true if you replace A and B by some other letters that you think sort in an obvious way, because someone of another culture has the opposite opinion.

    So for things displayed to people, you need a locale-aware collation. This isn’t a single collation algorithm, but a family, one per locale and is a well-studied problem solved among other ways by the “Unicode Collation Algorithm”.

    So then the question is, should the default sort in a language try to be locale-aware, implicitly pulling in a locale from “somewhere” and using that? I’d strongly argue “no”. For one thing, there is a huge advantage to having library functions that behave identically on every platform and system, and pulling the widely-varying locale implicity into core comparison functions breaks this.

    Second, it’s slow – and the slowness may vary by locale.

    Third, a probably overwhelming amount of the time the purpose of a sort is simply to put some strings in an ordered collection or something like that: not for end-user display. So you would be optimizing for an uncommon case.

    Fourth, if you make the sort locale-aware, do you also make equality location aware? In general, you’d like a <= b && b <= a to imply a == b which means your equality should be locale aware too, but that raises all the same issues as above for equality, and in that case they are even more in favor of “binary compare” I think.

    Fifth, even when locale-awareness is the goal, just pulling whatever locale the process is running under is often the wrong choice. If it’s a web server, you don’t care about the locale of the running machine, but of the remote user, and so on. So locale-awareness isn’t something that can be swept under the covers anyways in many cases. It’s something you have to deal with explicitly, so more or less equally hard whether sort implicitly takes locale into consideration anyways.

    So in my opinion, the ideal scenario is that the language provides “binary” equality and comparison operators by default, and robust locale-sensitive methods as well with configurable locale. That’s exactly what most languages do.

    1. Travis Downs says:

      Perhaps I should amend:

      That’s exactly what most languages do.

      to

      That’s exactly what some languages do, and the rest should step up their game.

    2. Maynard Handley says:

      “Third, a probably overwhelming amount of the time the purpose of a sort is simply to put some strings in an ordered collection or something like that: not for end-user display. So you would be optimizing for an uncommon case.”

      I agree with most of your points, but this one is tricky.
      You have two pools of clients. One is the guys writing linkers and parsers and suchlike, with the set of concerns you’re listing (“I don’t care about the precise details, but I want fast and stable sorts”); the other is developers writing user-facing code who just want the right thing to happen.

      Probably the fault is language minimalism that tries to force two very different functions into the same keyword/library function.
      Perhaps languages should just bite the bullet and accept the necessity for both
      ByteSort(string) and TextSort(string, locale[=by default from OS])
      ?

      1. Travis Downs says:

        Note that I probably should have said “comparisons” not “sorts” – since the string comparison behavior is at issue here and sorts are just one (obvious) way the comparison order is surfaced.

        I considered user-facing code when I wrote that. Look at any application that is user facing and see how many string comparisons are in there, and then how many of those end up displayed to the user as part of a sort.

        There are many where nearly all the comparisons are internal, especially if the languages uses ordering as a core component of its containers, as say C++ does (where the default containers like std::map and friends use ordering as their key relation).

        I’ll grant you that there might be some applications, especially small ones, where comparisons are primarily for sorted display to the user, but I think most will be “internal” uses.

        There is even a stronger case for equality – I think most would agree that the overwhelming use for equality for user-facing programs are internal mechanics that won’t be displayed to the user, or even if it affects user visible stuff, “binary equality” (an exact code-point by code-point comparison) works fine. So if you agree that binary equality is desirable, but you want locale-sensitive comparison, you are left with the awkward choice of having comparison operators which are inconsistent with equals. Yuck… stay away from that idea!

        1. Maynard Handley says:

          Aren’t the sorts of examples you’re giving a sort of “premature optimization”? You’re asserting that user facing apps “need” to utilize high-performance maps and hashes and equates and suchlike, and I’m not sure that’s at all true.

          Sure, I can believe that there are some backends like Twitter and Google that need to do whatever they can to run strings fast in their data centers. But they’re also capable of using code correctly. For the default programmer, and the default app, there seems no reason whatsoever to optimize for speed over “correctness”, however defined.

          Maybe, alternatively, the flaw is in ever allowing a sort() routine that doesn’t FORCE the use of a locale (and thus thinking about the issue)?
          So you can have
          sort(strings, kDefaultLocale[=what OS provides))
          sort(strings, kBytesLocal)
          sort(strings, explicit locale)
          but just sort(strings) does not exist as an option?

          1. Travis Downs says:

            No, I don’t think it’s premature to have reasonably fast collections by default in a language – it’s what most people want.

            However, fast or slow collections is a red-herring. It’s actually about fast or slow default string comparisons. I think people want fast string operations in general, yes. The concept of “premature optimization” doesn’t apply in the same way to library or base language features as it does for a finished application. In a finished application you probably understand the input domain and user-facing use cases so you can tell which operations need to be sped up to improve user experience, and the rest can be left alone.

            For a standard library or built-in language feature, you generally have no idea, or, more precisely – you have to support a huge spectrum of uses from people who don’t care about performance to cases where the cost of the operation will dominate runtime. So the tradeoffs are a bit different: you should make it as fast as possible within the constraints of a reasonable API, and if you have to make your API worse to make it faster, you might consider it depending on the goals of your language, your typical user, availability of alternate APIs etc.

            All that said, this isn’t mostly about performance anyways, but largely about correctness and adhering to the principle of least surprise. I would expect all the basic types to compare identically everywhere. Certainly 0 < 1 everywhere, etc. If strings silently start pulling the locale for comparisons, you are often making an invalid assumption (that whatever default locale is being used is at all appropriate, even if the user wanted locale-based collation in the first place), and you just unleashed a bunch of bugs in many locales that aren’t “en_us”. I would never want my default string comparisons to change behavior based on what user the process was running under, what locale was set by some code I wasn’t aware of, what environment variables where set, etc.

            So comparing by unicode code-point value seems entirely an entirely reasonable default to me, and this is exactly what almost every modern and many ancient languages do.

            Now there is a time and place for locale-aware display at the UI layer, and languages should better support this use case – but I definitely disagree you want to involve that with the comparator for the core string type.

  10. Max Lybbert says:

    I’m surprised how little this and other globalization questions come up. I find it fascinating. If I didn’t have experience in the industry, I would assume it isn’t mentioned in blog posts for the same reason unexpected integer division results aren’t mentioned: you get bit once and learn your lesson.

    Instead, I’m sure that while everybody knows that the C locale (sorting ASCII-betically, as they used to say in Perl) isn’t good enough, very few people know what is. And, if the ICU library’s API is anything to go by, the people who do know how to solve the problem apparently have no taste in API design.

    1. Travis Downs says:

      I have some experience with this, and in my experience it definitely “comes up”, but apparently not in a way that results in languages having first-class support for it.

      In any enterprise product localization is a big deal and everyone mostly seems to use third party libraries (read this as “ICU”) to support locale aware everything.

  11. Maynard Handley says:

    Mathematica (as usual) seems to have put some thought into this:
    Defaults work the way you want:
    Sort[{“e”, “a”, “é”, “f”}]
    {“a”, “e”, “é”, “f”}

    Sort[{“a”, “b”, “A”, “B”}]
    {“a”, “A”, “b”, “B”}

    But there has also been thought (and there exist rules) for ordering of “semantically significant” format/font variants, so

    Sort[{“e”, “a”, “é”, “f”, “[ExponentialE]”}]
    {“a”, “e”, “é”, “f”, “[ExponentialE]”}

    That latter element (“[ExponentialE]”) would display in Mathematica as a typographic variant of e, but it’s the “e” of mathematics, and is sorted as a different class that comes after “text”.

    1. I am not surprised regarding Mathematica, given that tools like Excel get it right.

  12. Anton Bobrov says:

    I agree that documentation regarding unicode collation is obscure. But the topic is complex by itself. Proper sorting is UI specific for a particular viewer and involves many external factors which is hard to handle with a default behavior. It will be almost wrong as current ASCII (binary) default but also much slower.

    Python can use system locale settings:

    import locale
    locale.setlocale(locale.LC_ALL, '')
    print(sorted(["e","a","é","f"], key=locale.strxfrm))
    # ['a', 'e', 'é', 'f']

    It also works with my en_US locale, because it defines proper order for diacritic “chars”.

  13. Ned Harding says:

    I love this! Having dealt with just this issue writing Alteryx (a data processing application), it was a huge struggle. For the reasons below I promise you that no data system has ever gotten this “right” and I am not sure it is even possible.

    Unfortunately the locale thing gets even more complicated when dealing with databases. The locale of the computer doesn’t matter, only the locale of the data. So me as an American looking at French data needs to use a French locale. There is no reasonable default in code that can anticipate this. So I punted and made the user decide the locale per sort tool (one Alteryx workflow can have multiple data connections.)

    But it gets worse! Imagine a database of names (think clients) that cross international borders. In order to properly sort a subset of this database the locale rules would need to be applied by which country each user is in. That means that depending how I filter and sort my database, I need to pick a different locale each time – and I have not even changed computers or data sources! And how am I meant to sort when I am including names from various countries?

  14. Andrew says:

    For future reference:

    For R:
    – stringi: stri_sort aprovides locale sensitive sorting
    – stringr: str_sort provides locale sensitive sorting

    For Python:
    – locale: provides locale sensitive sorting on Windows, and OSes using glibc. Fails on systems using bdslibc, musl libc, and libc implementations on embedded systems.
    – pyicu: a wrapper for icu4c, create a collator from existing locale, or build collator from custom rules

    Draft notes available on github