1st May 2008, 1 min read

Seeking an efficient algorithm to group identical values

In the past, I have had luck with my requests for help, so here is another one.

Suppose you have a large array made of a large number of distinct values ({A,B,A,B,A,C,C}) and you want to group the identical values like so {A,A,A,B,B,C,C} or like so {C,C, A,A,A,B,B}. That is, you do not care about the order, you just want identical values to be clustered. How do you do it?

You could sort the array assuming that there is an ordering between the objects. There are highly efficient external-memory sorting routines. They mostly rely on sequential IO and are amazingly fast.
You can build a hash table. Because it is a linear time operation, for very large arrays, it should be faster than sorting, in theory. However, the catch is that external-memory hash tables are not very efficient because they rely on random IO and are prone to cache misses. Remember kids that 100 n > n log n despite what your math. teacher taught you.
We can mix hashing and sorting. Scan the array, and randomly hash each value into one of L bins. You know that if the value x appears in bin i, then all values x are in the same bin. So, you can simply sort each bin and concatenate the bins in O(n log n/L) time, assuming your hashing is good enough.- One last possible trick might be to adapt fast duplicate detection algorithms such as the Teuhola-Wegner algorithm: J. Teuhola, L. Wegner, Minimal space, average linear time duplicate deletion, Communications of the ACM, 1991.

So, what do you think?