, 2 min read
How expensive are the union and intersection of two unordered_set in C++?
If you are using a modern C++ (C++11 or better), you have access to set data structures (unordered_set) which have the characteristics of a hash set. The standard does not provide us with built-in functions to compute the union and the intersection of such sets, but we can make our own. For example, the union of two sets A and B can be computed as follow:
out.insert(A.begin(), A.end());
out.insert(B.begin(), B.end());
where out
is an initially empty set. Because insertion in a set has expected constant-time performance, the computational complexity of this operation is O(size(A) + size(B)) which is optimal.
If you are a computer scientist who does not care about real-world performance, your work is done and you are happy.
But what if you want to build fast software for the real world? How fast are these C++ sets?
I decided to populate two sets with one million integers each, and compute how how many cycles it takes to compute the intersection and the union, and then I divide by 2,000,000 to get the time spent per input element.
intersection (unordered_set) | 100 cycles/element |
---|---|
union (unordered_set) | 250 cycles/element |
How good or bad is this? Well, we can also take these integers and put them in sorted arrays. Then we can invoke the set_intersection
and set_union
methods that STL offers.
set_intersection (std::vector) | 3 cycles/element |
---|---|
set_union (std::vector) | 5 cycles/element |
That’s an order of magnitude better!
So while convenient, C++’s unordered_set
can also suffer from a significant performance overhead.
What about std::set which has the performance characteristics of a tree? Let us use code as follows where out
is an initially empty set.
std::set_union(A.begin(), A.end(), B.begin(), B.end(),
std::inserter(out,out.begin()));
set_intersection (std::set) | 150 cycles/element |
---|---|
set_union (std::set) | 750 cycles/element |
As we can see, results are considerably worse.
The lesson is that a simple data structure like that of std::vector can serve us well.