28th September 2016, 7 min read

Sorting already sorted arrays is much faster?

10 thoughts on “Sorting already sorted arrays is much faster?”

KWillets says:

September 28, 2016 at 4:19 pm

I’ve tried to find a difference between Dutch National Flag algorithms based on the number of swaps they do, but nothing seemed to show up in the timings.

String sorting is also particularly interesting on current architectures, for both caching and branch prediction reasons.

One typo to note: you put an n in front of “log n is 20”.
1. Daniel Lemire says:
  
  September 28, 2016 at 5:00 pm
  
  Variable-length or very long strings can make it hard for the processor to look ahead. That’s why it is a good idea to get rid of strings when you can.
  1. KWillets says:
    
    September 28, 2016 at 7:34 pm
    
    Well, I could go on about that, but for long strings (total distinguishing prefixes D >> n log n) the most common character comparison is equality, so branch prediction is pretty easy. It’s the cache complexity that sucks — many classic algorithms are O(D) in cache misses.
    1. Daniel Lemire says:
      
      September 28, 2016 at 8:17 pm
      
      so branch prediction is pretty easy
      
      I did not use the expression “branch prediction” anywhere because I think that counting branch mispredictions is too simplistic.
      
      Variable-length strings can seriously hurt superscalar execution even if there are few branch mispredictions… because the processor can’t tell when the string ends… and so it does not start working on the next string.
      
      You are right that memory accesses are going to be expensive, but if they can be predicted ahead of time, they can be free… because the data has been prefetched.
      
      The problem with variable-length strings is that they can blind the processors to what is coming next. It just sees the string and does not look at the work that needs to be done after looping over the string.
      
      Of course, if the strings do not fit in CPU cache, then you are going to start having performance trouble… at some point, you cannot sort faster than you can read and write to RAM… but there are cache-friendly algorithms that can help…
      1. KWillets says:
        
        September 28, 2016 at 11:04 pm
        
        I see, yes I did mistake branch prediction for cache prediction. The latter indeed suffers with anything using indirect references, which the memory predictor can’t predict, and the pointers are by definition permuted vs. the original data (or at least they become that way after a few rounds). That’s what makes it so exciting.
        
        The good news is that it’s possible to sort with only O(n log n) real cache misses, with the other O(D) character accesses being contiguous and prefetchable.

Michel Lemay says:

September 29, 2016 at 1:31 pm

Hi Daniel, so what is the result of that timsort on your machine?

Daniel Lemire says:

September 29, 2016 at 1:51 pm

@Michel

Timsort is a pretty good idea. It is 50% slower on shuffled arrays, but drastically (10x) faster on sorted ones.

std::sort(v.begin(), v.end()) [           std::sort(v.begin(), v.end())]:  40.95 cycles per element
gfx::timsort(v.begin(), v.end()) [           std::sort(v.begin(), v.end())]:  2.01 cycles per element
std::sort(v.begin(), v.end()) [ std::random_shuffle(v.begin(), v.end())]:  234.56 cycles per element
gfx::timsort(v.begin(), v.end()) [ std::random_shuffle(v.begin(), v.end())]:  341.48 cycles per element
std::sort(v.begin(), v.end()) [         std::sort(v.rbegin(), v.rend())]:  32.29 cycles per element
gfx::timsort(v.begin(), v.end()) [         std::sort(v.rbegin(), v.rend())]:  2.65 cycles per element

KrzaQ says:

September 30, 2016 at 9:08 am

You should compare different implementations of the standard library. If I recall correctly, clang’s `libc++` is optimized for the common cases, whereas gcc’s `libstdc++` is not.
1. Daniel Lemire says:
  
  September 30, 2016 at 1:56 pm
  
  Do you expect that my analysis will depend on the standard library I use?
  1. KrzaQ says:
    
    October 14, 2016 at 8:20 pm
    
    I sure hope so, otherwise their optimizations would be pointless 🙂