10th October 2016, 14 min read

A case study in the performance cost of abstraction (C++’s std::shuffle)

15 thoughts on “A case study in the performance cost of abstraction (C++’s std::shuffle)”

Konrad Rudolph says:

October 10, 2016 at 5:30 pm

Oddly, the difference persists when using 64 integer types throughout, and when using a different (64 bit) random number generator. But only on clang/libc++. Using g++ 6.0 and libstdc++, the difference vanishes completely.

This suggests a rather serious bug in the libc++ implementation of std::shuffle: std::shuffle was specifically specified to allow implementations that do not cause runtime penalties due to abstraction, and in this function it replaces the now-deprecated std::random_shuffle, the interface of which required inefficient implementations.

For what’s it worth, I don’t think an implementation can actually do better than your â€œnaÃ¯veâ€ textbook implementation, and implementors should use it (libstdc++ does, libc++ doesn’t).
ahminus says:

October 10, 2016 at 7:24 pm

And there’s a reason he doesn’t actually show input and produce a total running time output and instead cheats and just shows “clock cycles per value”. Because for any small shuffle, it makes no difference, and for any very large value, you need to consider whether you’re shuffling billions of values.
Hence, std::shuffle.
Unless you’re shuffling many collections constantly, this optimization isn’t worthy of any note. BTW, just for edification, I wrote a shuffler that runs as a micro-service on the network. All it does is return a shuffled deck of cards for the next hand of poker for thousands and thousands of tables of poker running concurrently (for a real live poker app. that is in production). I would never have bothered with this “optimization”.
EDIT: If I’m doing the math right (and I often don’t), his example code runs in about 25ns using std::shuffle. And roughly half that using his “textbook” version. So, how many items do we need to be shuffling for any of this to make any real difference? Roughly billions (see above). Talk about useless/premature optimization.
1. Daniel Lemire says:
  
  October 10, 2016 at 7:54 pm
  
  Unless you’re shuffling many collections constantly, this optimization isn’t worthy of any note.
  
  It is not an optimization to write 3 lines of code straight out of a textbook.
  1. Bob Jones says:
    
    October 11, 2016 at 3:40 am
    
    clearly it is, since it improves the speed…
    1. Daniel Lemire says:
      
      October 11, 2016 at 1:34 pm
      
      I think that the term “optimization” refers to an intent. That is, you mean the code to be faster. Otherwise the term is meaningless.
      1. KWillets says:
        
        October 11, 2016 at 4:05 pm
        
        I would call it a de-pessimization.
2. Moschops says:
  
  October 11, 2016 at 12:10 pm
  
  “Talk about useless/premature optimization.”
  
  This is neither useless nor premature. I surmise you are from the school that misinterprets Hoare and Knuth, and uses it as an excuse to not bother thinking about what you’re coding.
Nir Friedman says:

October 11, 2016 at 9:36 pm

I don’t think I can get on board with the conclusion/title.

First off, std::string vs const char * seems to me to be entirely orthogonal to the implementation of std::shuffle. shuffle will shuffle any iterable container, so you can have an array of string or an array of const char *. Maybe you are adding string vs const char * to the “cost of abstraction” argument (title makes it seems like it’s only shuffe)? But in any case, string is not slower than const char * in this particular case because of abstraction. It’s slower here because it has different performance characteristics (as part of a deliberate choice); knowing its own size, having extra space, and avoiding heap allocations in many cases, all allow it to be faster than const char * in many other cases, just not in this one.

Second, the thing that actually makes shuffle slow is not abstraction, it’s an implementation detail, which is motivated by a need for correctness (arrays with at least 4 billion elements, which with const char * would only require 32 gigs of RAM, btw). The only abstraction in shuffle’s interface is the use of iterators instead of raw pointers, but this hasn’t been shown to have any cost.

Ironically, shuffle’s implementation could be trivially altered to have optimal performance/correctness characteristics by adding more abstraction. Namely, you could add another template parameter controlling the integer type used. It would default to size_t for correctness, but users could switch it to whatever was fastest on their platform if they had foreknowledge of their array size.

Of course, this comes at a cost in interface complexity. Anyhow, interesting information, but I don’t agree with the conclusions.
1. Daniel Lemire says:
  
  October 12, 2016 at 1:33 pm
  
  Anyhow, interesting information, but I don’t agree with the conclusions
  
  That’s fair. Note however that I always post my code and that my experimental results should be easily reproducible. So you can disagree with my interpretation, but, hopefully, you should not disagree with the experimental observations.
  
  Maybe you are adding string vs const char * to the â€œcost of abstractionâ€ argument (title makes it seems like it’s only shuffe)? But in any case, string is not slower than const char * in this particular case because of abstraction. It’s slower here because it has different performance characteristics (as part of a deliberate choice); knowing its own size, having extra space, and avoiding heap allocations in many cases, all allow it to be faster than const char * in many other cases, just not in this one.
  
  std::string has a higher abstraction level than char *. For performance-sensitive code, I avoid std::string. For the type of work I do, it has historically always been slower than a C string. I hear that in C++11, some of the performance issues with std::string have been addressed, but prior to this, there is ample documentation on the problems caused by std::string. I should add that I am not a fan of the char * paradigm on the principle that it encourages branching… but experimental evidence shows that it is very good for performance in many applications.
  
  Second, the thing that actually makes shuffle slow is not abstraction, it’s an implementation detail, (…)
  
  We are comparing calling a library that handles all the implementation details for us, with coding our own so that the implementation is transparent. The library call has a higher level of abstraction. A higher level of abstraction can be beneficial, and it can have a cost… usually, you have a bit of both… benefits and costs…
  
  The only abstraction in shuffle’s interface is the use of iterators instead of raw pointers, but this hasn’t been shown to have any cost.
  
  I haven’t checked but I do not expect the iterators to be an issue as far as performance goes (based on past experience).
  
  Iterators in C++ are great performance-wise. (In Java, they can be a problem.)
  
  Ironically, shuffle’s implementation could be trivially altered to have optimal performance/correctness characteristics by adding more abstraction. Namely, you could add another template parameter controlling the integer type used. It would default to size_t for correctness, but users could switch it to whatever was fastest on their platform if they had foreknowledge of their array size.
  
  Or you could branch on the number of elements that there are in the container. I think that the library could be engineered so that the difference I indicate goes away.
Dag BrÃ¼ck says:

October 12, 2016 at 3:55 pm

Unfortunately you do not conclusively answer the question ‘why?’ Please investigate further what the reason is, for example by analyzing the generated code.
1. Daniel Lemire says:
  
  October 12, 2016 at 4:34 pm
  
  Because I provide my full source code, you should feel free to proceed with any analysis you have in mind.
  1. Dag BrÃ¼ck says:
    
    October 12, 2016 at 4:46 pm
    
    Why should I? It is your blog. Work out the details and you can make some interesting observations.
    1. Daniel Lemire says:
      
      October 12, 2016 at 5:54 pm
      
      I did work it out and it is explained in my blog post. Divisions are the bottleneck (I explain that by removing them I can multiply the speed by 10x). The standard library uses 64-bit arithmetic. (Hint: a 64-bit division is much slower than a 32-bit division…) So that is all in the blog post.
      
      It is entirely reasonable to want a more thorough analysis, but I keep my blog posts short by design, preferring instead to refer the readers to the code and encouraging experiments.
      
      In any case, whatever deep analysis I come up with can be reasonably questioned. Why did you use this compiler and not this other compiler? Why did you use this standard library? What about the processor, what happens if you try another… it is a rabbit hole… one has to define the scope somehow.
Risto Lankinen says:

October 13, 2016 at 10:57 am

Here’s one more argument supporting custom (as opposed to std::) shuffling algorithm:

How often is it that you need a whole array shuffled anyway? Most use cases for shuffling only require a handful of items (say, two hands of five cards from a deck of 52+2) where items are only required to be distinct from each other, and from other hands. Hence a more efficient algorithm would amortize the shuffling cost over each draw of a next item, in order to never need to shuffle the entire array.
1. jrodman says:
  
  October 29, 2016 at 1:57 am
  
  There is some surprising unexpectedness in the behavior of many psuedo-random number generation situations, where random-pick does not contain enough variation to approximate real behavior of cards as compared to an actual shuffle.
  
  There may scenarios where this possible weakness does not matter, but when modelling cards, an actual shuffle is typically a much closer modelling than a random-pick.