Daniel Lemire's blog

, 7 min read

Sorting already sorted arrays is much faster?

10 thoughts on “Sorting already sorted arrays is much faster?”

  1. KWillets says:

    I’ve tried to find a difference between Dutch National Flag algorithms based on the number of swaps they do, but nothing seemed to show up in the timings.

    String sorting is also particularly interesting on current architectures, for both caching and branch prediction reasons.

    One typo to note: you put an n in front of “log n is 20”.

    1. Variable-length or very long strings can make it hard for the processor to look ahead. That’s why it is a good idea to get rid of strings when you can.

      1. KWillets says:

        Well, I could go on about that, but for long strings (total distinguishing prefixes D >> n log n) the most common character comparison is equality, so branch prediction is pretty easy. It’s the cache complexity that sucks — many classic algorithms are O(D) in cache misses.

        1. so branch prediction is pretty easy

          I did not use the expression “branch prediction” anywhere because I think that counting branch mispredictions is too simplistic.

          Variable-length strings can seriously hurt superscalar execution even if there are few branch mispredictions… because the processor can’t tell when the string ends… and so it does not start working on the next string.

          You are right that memory accesses are going to be expensive, but if they can be predicted ahead of time, they can be free… because the data has been prefetched.

          The problem with variable-length strings is that they can blind the processors to what is coming next. It just sees the string and does not look at the work that needs to be done after looping over the string.

          Of course, if the strings do not fit in CPU cache, then you are going to start having performance trouble… at some point, you cannot sort faster than you can read and write to RAM… but there are cache-friendly algorithms that can help…

          1. KWillets says:

            I see, yes I did mistake branch prediction for cache prediction. The latter indeed suffers with anything using indirect references, which the memory predictor can’t predict, and the pointers are by definition permuted vs. the original data (or at least they become that way after a few rounds). That’s what makes it so exciting.

            The good news is that it’s possible to sort with only O(n log n) real cache misses, with the other O(D) character accesses being contiguous and prefetchable.

  2. Michel Lemay says:

    Hi Daniel, so what is the result of that timsort on your machine?

    1. @Michel

      Timsort is a pretty good idea. It is 50% slower on shuffled arrays, but drastically (10x) faster on sorted ones.

      std::sort(v.begin(), v.end()) [           std::sort(v.begin(), v.end())]:  40.95 cycles per element
      gfx::timsort(v.begin(), v.end()) [           std::sort(v.begin(), v.end())]:  2.01 cycles per element
      std::sort(v.begin(), v.end()) [ std::random_shuffle(v.begin(), v.end())]:  234.56 cycles per element
      gfx::timsort(v.begin(), v.end()) [ std::random_shuffle(v.begin(), v.end())]:  341.48 cycles per element
      std::sort(v.begin(), v.end()) [         std::sort(v.rbegin(), v.rend())]:  32.29 cycles per element
      gfx::timsort(v.begin(), v.end()) [         std::sort(v.rbegin(), v.rend())]:  2.65 cycles per element
      
  3. KrzaQ says:

    You should compare different implementations of the standard library. If I recall correctly, clang’s `libc++` is optimized for the common cases, whereas gcc’s `libstdc++` is not.

    1. Do you expect that my analysis will depend on the standard library I use?

      1. KrzaQ says:

        I sure hope so, otherwise their optimizations would be pointless 🙂