Daniel Lemire's blog

, 2 min read

How expensive are the union and intersection of two unordered_set in C++?

If you are using a modern C++ (C++11 or better), you have access to set data structures (unordered_set) which have the characteristics of a hash set. The standard does not provide us with built-in functions to compute the union and the intersection of such sets, but we can make our own. For example, the union of two sets A and B can be computed as follow:

out.insert(A.begin(), A.end());
out.insert(B.begin(), B.end());

where out is an initially empty set. Because insertion in a set has expected constant-time performance, the computational complexity of this operation is O(size(A) + size(B)) which is optimal. If you are a computer scientist who does not care about real-world performance, your work is done and you are happy. But what if you want to build fast software for the real world? How fast are these C++ sets?

I decided to populate two sets with one million integers each, and compute how how many cycles it takes to compute the intersection and the union, and then I divide by 2,000,000 to get the time spent per input element.

intersection (unordered_set) 100 cycles/element
union (unordered_set) 250 cycles/element

How good or bad is this? Well, we can also take these integers and put them in sorted arrays. Then we can invoke the set_intersection and set_union methods that STL offers.

set_intersection (std::vector) 3 cycles/element
set_union (std::vector) 5 cycles/element

That’s an order of magnitude better!

So while convenient, C++’s unordered_set can also suffer from a significant performance overhead.

What about std::set which has the performance characteristics of a tree? Let us use code as follows where out is an initially empty set.

std::set_union(A.begin(), A.end(), B.begin(), B.end(),
                        std::inserter(out,out.begin()));
set_intersection (std::set) 150 cycles/element
set_union (std::set) 750 cycles/element

As we can see, results are considerably worse.

The lesson is that a simple data structure like that of std::vector can serve us well.

My source code is available.