I’ve tried to find a difference between Dutch National Flag algorithms based on the number of swaps they do, but nothing seemed to show up in the timings.
String sorting is also particularly interesting on current architectures, for both caching and branch prediction reasons.
One typo to note: you put an n in front of “log n is 20”.
Variable-length or very long strings can make it hard for the processor to look ahead. That’s why it is a good idea to get rid of strings when you can.
KWilletssays:
Well, I could go on about that, but for long strings (total distinguishing prefixes D >> n log n) the most common character comparison is equality, so branch prediction is pretty easy. It’s the cache complexity that sucks — many classic algorithms are O(D) in cache misses.
I did not use the expression “branch prediction” anywhere because I think that counting branch mispredictions is too simplistic.
Variable-length strings can seriously hurt superscalar execution even if there are few branch mispredictions… because the processor can’t tell when the string ends… and so it does not start working on the next string.
You are right that memory accesses are going to be expensive, but if they can be predicted ahead of time, they can be free… because the data has been prefetched.
The problem with variable-length strings is that they can blind the processors to what is coming next. It just sees the string and does not look at the work that needs to be done after looping over the string.
Of course, if the strings do not fit in CPU cache, then you are going to start having performance trouble… at some point, you cannot sort faster than you can read and write to RAM… but there are cache-friendly algorithms that can help…
KWilletssays:
I see, yes I did mistake branch prediction for cache prediction. The latter indeed suffers with anything using indirect references, which the memory predictor can’t predict, and the pointers are by definition permuted vs. the original data (or at least they become that way after a few rounds). That’s what makes it so exciting.
The good news is that it’s possible to sort with only O(n log n) real cache misses, with the other O(D) character accesses being contiguous and prefetchable.
Michel Lemaysays:
Hi Daniel, so what is the result of that timsort on your machine?
Timsort is a pretty good idea. It is 50% slower on shuffled arrays, but drastically (10x) faster on sorted ones.
std::sort(v.begin(), v.end()) [ std::sort(v.begin(), v.end())]: 40.95 cycles per element
gfx::timsort(v.begin(), v.end()) [ std::sort(v.begin(), v.end())]: 2.01 cycles per element
std::sort(v.begin(), v.end()) [ std::random_shuffle(v.begin(), v.end())]: 234.56 cycles per element
gfx::timsort(v.begin(), v.end()) [ std::random_shuffle(v.begin(), v.end())]: 341.48 cycles per element
std::sort(v.begin(), v.end()) [ std::sort(v.rbegin(), v.rend())]: 32.29 cycles per element
gfx::timsort(v.begin(), v.end()) [ std::sort(v.rbegin(), v.rend())]: 2.65 cycles per element
You should compare different implementations of the standard library. If I recall correctly, clang’s `libc++` is optimized for the common cases, whereas gcc’s `libstdc++` is not.
I’ve tried to find a difference between Dutch National Flag algorithms based on the number of swaps they do, but nothing seemed to show up in the timings.
String sorting is also particularly interesting on current architectures, for both caching and branch prediction reasons.
One typo to note: you put an n in front of “log n is 20”.
Variable-length or very long strings can make it hard for the processor to look ahead. That’s why it is a good idea to get rid of strings when you can.
Well, I could go on about that, but for long strings (total distinguishing prefixes D >> n log n) the most common character comparison is equality, so branch prediction is pretty easy. It’s the cache complexity that sucks — many classic algorithms are O(D) in cache misses.
so branch prediction is pretty easy
I did not use the expression “branch prediction” anywhere because I think that counting branch mispredictions is too simplistic.
Variable-length strings can seriously hurt superscalar execution even if there are few branch mispredictions… because the processor can’t tell when the string ends… and so it does not start working on the next string.
You are right that memory accesses are going to be expensive, but if they can be predicted ahead of time, they can be free… because the data has been prefetched.
The problem with variable-length strings is that they can blind the processors to what is coming next. It just sees the string and does not look at the work that needs to be done after looping over the string.
Of course, if the strings do not fit in CPU cache, then you are going to start having performance trouble… at some point, you cannot sort faster than you can read and write to RAM… but there are cache-friendly algorithms that can help…
I see, yes I did mistake branch prediction for cache prediction. The latter indeed suffers with anything using indirect references, which the memory predictor can’t predict, and the pointers are by definition permuted vs. the original data (or at least they become that way after a few rounds). That’s what makes it so exciting.
The good news is that it’s possible to sort with only O(n log n) real cache misses, with the other O(D) character accesses being contiguous and prefetchable.
Hi Daniel, so what is the result of that timsort on your machine?
@Michel
Timsort is a pretty good idea. It is 50% slower on shuffled arrays, but drastically (10x) faster on sorted ones.
You should compare different implementations of the standard library. If I recall correctly, clang’s `libc++` is optimized for the common cases, whereas gcc’s `libstdc++` is not.
Do you expect that my analysis will depend on the standard library I use?
I sure hope so, otherwise their optimizations would be pointless 🙂