Daniel Lemire's blog

, 35 min read

Do not waste time with STL vectors

46 thoughts on “Do not waste time with STL vectors”

  1. Hi Daniel,
    Thanks for this post, this is good timing for a project I’m working on. If you intend to investigate this kind of topic in more depth, I’d love to see some analysis of using vector<vector> for 2-d arrays of raw data. In particular I’m trying to understand how this representation affects memory locality, and how I can optimally allocate and traverse column-major blocks of data retrieved from a 3rd-party. The goal is to produce typed data values from the raw bytes of each column. Since each column may have a very different data type, the memory access stride will change when finishing one column and moving to the next, so this already impacts the hardware prefetcher. Are things made worse by the fact that each column may be allocated in very different locations of memory? Last, I have a separate 2-d array of per-cell status indicators, so this compounds the locality challenge.

    Thanks for maintaining this blog, I’ve found it very useful.
    -Robert

  2. mars says:

    In your code, yhy is the following not safe?

    int runtestnoalloc(size_t N, int * bigarray) {
    for(unsigned int k = 0; k<N; ++k)
    bigarray[k] = k;// unsafe

    int sum = 0;
    for(unsigned int k = 0; k<N; ++k)
    sum += bigarray[k];

    return sum;
    }

    The bigarray is never out of scope so shouldn't it be fine? Is it because bigarray[k] can overwrite something in the memory?

  3. @mars

    It is perfectly safe in this instance, but playing with pointers can lead to segmentation faults. How do you know that whoever called this function has allocated enough memory?

    Using STL containers is safer because you can check.

  4. Konrad says:

    If you really need the raw speed during construction, there’s another variant, which I’ve just sent you via a pull request on GitHub.

    The salient part is construction of the vector via a pair of iterators. But we don’t actually construct the range of values in memory (that would just shift the initialisation problem elsewhere), we use a special iterator type “iota” which generates the values on the fly. Creating this iterator incurs a bit of boilerplate code but it’s a general-purpose class that can easily be recycled.

    This is “good C++” because it uses high-level algorithmic building blocks rather than manual memory management or UB hacks, and furthermore increases the abstraction of the initialisation code (making the loop obsolete).

    Interestingly, this variant is *even faster* than the fastest other code (which is illegal C++ anyway and can’t really be counted) when compiled with optimisation level 3, and it’s on par with the other methods when compiled with O2.

  5. srean says:

    Was this for g++ ? I would have expected the call to push_back() to be inlined away.

    May be the compiler needs to be strongly persuaded.

    1. Travis Downs says:

      push_back will certainly be “inlined” with most optimization settings but not “inlined away” – the substance of the call, including the size check and fallback path will remain, which accounts for the slowness.

      Vector is definitely not a zero cost abstraction, especially if you use it size-dynamically eg using push_back. It’s just too hard for the compiler to optimize away the size checks, and this also inhibits vectorization and other optimizations (you can get even worse results than those reported here in real world cases).

  6. @srean

    I post the code so that you can run your own checks:

    https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2012/06/20/testvector.cpp

    There is a significant cost to using push_back.

  7. srean says:

    Ah ! I missed the comment on the first line about g++ and opt flags, hence my question

    I hear icc is very good at inlining away such overhead. However I dont have acess to it.

  8. Konrad says:

    @srean Even without looking at the assembly I’m all but *certain* that the `push_back` call is inlined – this is a trivial optimisation for g++ and the standard library was *designed* with this optimisation in mind. The cost you see here is not the cost of the call, it’s the cost of the bounds check (`push_back` needs to check whether the internal buffer needs to be resized, even if we reserved memory beforehand). Otherwise `push_back` and the `[]` method would be on par – after all, the latter also implies a call. But it doesn’t have the overhead of the bounds check so we can infer that the time difference between these two methods is entirely explained by said bounds check.

    One word about the iterator method I posted above: I made an interesting mistake: the iterator should have been defined as a random access iterator. That way, the vector constructor of the libstdc++ implementation will pre-allocate an internal buffer of exactly the right size. My iterator implementation fails to provide this information. The mistake is interesting because g++ obviously pre-allocates the memory *anyway* – otherwise the runtime we observe would be impossible. This showcases how incredibly powerful the optimiser is – it infers the missing information about the required buffer size itself.

    In fact, correcting the mistake and implementing the iterator as a random access iterator doesn’t change the performance – conclusive proof that the compiler does the optimisation discussed above.

  9. srean says:

    @Konrad Oh! I had totally forgotten about bounds checking. Yes now it makes sense.

  10. @srean

    It is entirely possible that another compiler could optimize away the overhead of the push_back. Even so, if you really need speed, you can’t just hope for the best. I recommend you use push_backs sparingly in performance sensitive code.

    (I should stress that this is for vectors of integers. You might get vastly different results for vectors of objects.)

  11. srean says:

    Sure.

    In case I gave another impression, my point was never to argue against your recommendation, but to understand what exactly is the reason behind the slowdown and how can compilers mitigate it and if not why not. Konrad’s comments was very illuminating in that respect.

    I think even with vectors of objects, push_back will be slower than directly mutating the contents of a valid location in the vector.

  12. Ivan says:

    meh, tbh this kind of data is almost useless for me because I rarely spend my time filling vectors. I spend time reading them, so comparison of 2dvector and hackish 2d vector that is implemented as 1d vector would be more interesting.

    Also you should have mentioned std::move and how it makes STL faster in C++11.

  13. @Ivan

    You think that std::move would make things faster in this particular case?

    1. It looks like using std::move could be significantly faster

  14. Ivan says:

    Nope, I was talking about:
    “Another problem with the vector template is that whenever you copy a vector, you copy all the data it contains. Again, to avoid this problem, you must use the swap method.”

  15. Ivan says:

    ofc, it only work for temporaries if recall correctly. RVR are mystic to me tbh . 🙂

  16. Anonymous says:

    Perhaps a typo: the phrase “Suppose I want to create an array in C++ containing the numbers from 0 to N in increasing order” should perhaps have “N-1”.

  17. Valrandir says:

    Consider using iterators to access the vector elements rather than using operator[], which needs to calculate the memory offset based on the position and the size of each element. It probably does it as such: return ptr[position], and the compiler then needs to calculate *((void*)ptr + sizeof(T) * position), which means a multiplication for every access. c++ Pointer arithmetic will do the sizeof multiplication implicitly.

    Consider this alternative:

    vector bigarray(N);
    auto it = bigarray.begin();
    auto end = bigarray.end();
    for(; it < end; ++it)
    *it = k;
    int sum = total(bigarray,N);
    return sum;

    This alternative should simply add sizeof(T) (in this case sizeof(int)) to the pointer. It means we only need one addition, instead of both an addition and a multiplication.

    This should close the gap with using c++ new.

  18. @Valrandir

    which means a multiplication for every access

    I agree that a compiler could implement this way, but modern compilers should know better.

    For one thing, a multiplication by sizeof(T) is a multiplication by a power of 2, which the compiler can implement as a mere shift.

    But even so, in this case the compiler almost certainly avoids even this overhead.

    Feel free however to run benchmarks to verify your conjecture (follow the link to get the code).

  19. Valrandir says:

    I ran benchmarks using iterators and found no meaningful difference compared to using operator[]

  20. Jeff says:

    Since you are using C++11x, you should look at emplace_back(). It should be significantly faster than push_back.

  21. @Jeff

    I do not think that emplace_back is likely to be faster in this case. See http://stackoverflow.com/questions/13883299/c-vector-emplace-back-is-faster

  22. Jeff says:

    I don’t think the person who answered that SO question knows what he’s talking about. The other SO article referenced there (http://stackoverflow.com/questions/4303513/push-back-vs-emplace-back) does support my claim. And from http://en.cppreference.com/w/cpp/container/vector/emplace_back: “Appends a new element to the end of the container. The element is constructed in-place, i.e. no copy or move operations are performed. The constructor of the element is called with exactly the same arguments that are supplied to the function.”

    If you’re constructing in-place and you avoid a temporary variable, that should be faster as you avoid both a second memory allocation and a copy.

  23. @Jeff

    Well, it is not faster and even slightly slower according to my tests:

    https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2012/06/20/testvectorcpp11.cpp

  24. Jeff says:

    I’ve played with your benchmark code, and I can say: that’s a result of the benchmark you’re using.

    emplace_back is meant to avoid having to run a copy constructor on an object…but you’re using it in a benchmark that is adding a bunch of ints, which is a primitive, not an object. As a result, it’s entirely possible that you’re hitting code paths unnecessarily where emplace_back is requiring extra steps for a primitive as it normally requires a new object to be created to be created whereas the simple copy of the primitive is optimized away by the compiler.

    I can make some small changes to your benchmark and have emplace_back come out ahead — like using uint_fast64_t instead of int and removing the reservation (which is not always feasible, as I’m sure you’re aware). Yes, I’m aware that this is not the case you originally presented in your blog.

    Moreover, even with a very simple actual object I can make emplace_back become neck and neck with push_back and pull ahead if you remove reservation:

    class BenchmarkTester {
    public:
    BenchmarkTester()
    : m_myVec(100)
    , m_myBool(false)
    , m_myPair(24.5, std::string(“Happy days are here again”))
    {
    }

    private:
    std::vector m_myVec;
    bool m_myBool;
    std::pair m_myPair;
    };

    With complex objects, emplace_back can start becoming very time-saving. Like all things, it’s a matter of using it when appropriate.

  25. @Jeff

    Please see Why I like the new C++ and specifically the “move semantics” section where I show evidence that push_back is already much faster for large objects in C++11.

    I would really like to see a full benchmark that shows that emplace_back is even faster…

    (Look at my GitHub files, I did check it and I see no benefit to emplace_back… even for large objects.)

  26. Jeff says:

    I don’t know which GitHub files you’re indicating and don’t really have time to trawl through all of them finding other uses of emplace other than the one file you already pointed me to. But it wasn’t difficult in that one file to make changes (mainly using the simple object that I posted above) that caused emplace_back to be faster than push_back, and I expect that trend to be even more pronounced with increased object complexity and/or size.

    I didn’t actually even create a copy constructor in my class above, and still got slightly faster execution than push_back. If you have copy constructors that actually go through various members of a class and copy all of the data (perhaps running an equality check first to avoid copying if the objects are equal, which itself might take a lot of time), then I would expect emplace_back to not just equal push_back but surge way ahead.

    It’s just clearly not optimized for pushing primitives into a vector, as they aren’t proper objects to begin with.

  27. @Jeff

    Ok, I was able to get a 10% gain with emplace_back under GCC 4.7 using a variation on your class and a 20% with int on a fast destkop.

    Please see my new code there:

    https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2012/06/20/testvectorcpp11.cpp

    20% is not a lot, but it is on a primitive… however, only on my desktop, not on my laptop… However, if I switch to clang 3.2, the difference is even more significant.

    So my conclusion is that you are right that emplace_back is faster, though there might compiler/cpu issues to deal with.

  28. Jeff Mitchell says:

    Cool 🙂 Sounds more like what should happen. After all, if emplace_back was slower in the general case, there would hardly have been much impetus to put it in 🙂

  29. sonal says:

    Hello,

    I am facing a similar issue with C++ hash_map.
    I am reading one CSV input file and storing each column value (with column name) in a hash_map. These values are added dynamically and this hash_map is stored inside a “Record” object which is also allocated dynamically.
    So effectively, i have lot of memory being allocated dynamically and if the input file is huge in size, it ends up using a lot of memory.
    When the file is processed, i am calling “clear()” method on the hash_map and in addition deleting memory for these columns and the record object too. I have used appropriate destructors too.

    But even after calling clear() and delete(), the memory is not released to the O.S. It shows the same amount of memory being used by my process with the “top” command before calling clear() or delete().
    I tried using swap() trick to swap it with empty hash_map, but still it does not seem to release the memory.

    Am I missing something here ? Please give suggestions.

  30. Jason says:

    Hi, Daniel:
    I guess you could change a vector’s capacity using vector::shrink_to_fit(). http://en.cppreference.com/w/cpp/container/vector/shrink_to_fit

    1. Good point, yes, if you have C++11 support.

  31. I started reading this post because of its uncommon title. Nice wordplay!! 😀

  32. ashish says:

    Well!! nice article, but regarding the title. I think this is more of misguiding rather than word-play.

  33. sudo says:

    It is outside the specifications and could potentially be unsafe, but it saves you up to half a CPU cycle per integer.

    I thought it was the best way to do it. We both agree on the point that it’s faster. Could you explain how does constructing an empty vector and then reserving the max size it can go can turn out to be “unsafe”?

    1. I don’t think the concept of “unsafe” works in that direction. That is, you have to show that it is safe as per the specifications. If you cannot show that, then it fair to call it unsafe because you just don’t what the compiler and the standard library might do. It might work one day on your system and then fail horribly on some other system.

    2. jwakely says:

      It’s not unsafe to reserve the space, obviously. It’s undefined behaviour to write into the space that has been reserved but not yet made part of the valid range of the vector. If you do that the compiler is allowed to turn your program into a smouldering pile of ashes. Specifically, it might actually check with some compiler options (e.g. AddressSanitizer, or a “checked iterators” debugging mode) and abort at runtime.

  34. Gauss says:

    Your sum is:
    + + + … + +
    Reordering:
    + + + + + + …
    Since:
    + = + = + = … ,
    then your sum is:
    /2 * (+) =

    1 + 2 + 3 + … + (N-1) + N = N * (N+1) / 2

  35. Kurt Guntheroth says:

    look at std::vector::emplace_back(), which constructs the element in place using template magic. Haven’t timed it though.

    1. Please see the above comments. I don’t think it can be faster.

    2. jwakely says:

      Emplacing an int isn’t going to help anything at all.

  36. Travis Downs says:

    I am a bit surprised that even modern versions of clang and gcc cannot optimize away the zeroing of the array implied in the std::vector(size_t N) constructor, since the zeroed elements are fully overwritten immediately after. They can recognize and eliminate such assignments in simpler cases, e.g., assign_heap_fixed in this example. It seems like the problem is with array of dynamic size: it works for fixed-sized arrays (maybe only up to a limit). This may have to do with the compiler’s ability to “scalarize” small fixed-size arrays: to treat them as if all their elements were independent variables and hence track dead stores, etc, for each one.

  37. András Sz?cs says:

    What about the best of both worlds?
    Use reserve() and index based access:

    vector<int> bigarray;
    bigarray.reserve(N)
    for(unsigned int k = 0; k<N; ++k)
    bigarray[k] = k;
    int sum = total(bigarray,N);
    return sum;

    For me it reached the speed of the C-style array (C++ new in your post).

    1. But that is not legal code, it leaves you with a broken vector instance and would be a fireable offense at most places. 🙂

      1. András Sz?cs says:

        Interesting! I found that ‘solution’ after I forgot to change a line after some copy-pasting, after I was investigating things. That is when I came across this post. Prior to that mistake I never would have thought about accessing a vector in this fashion and now I wonder how broken this vector become.