, 10 min read
Quickly returning the top-k elements: computer science vs. the real world
14 thoughts on “Quickly returning the top-k elements: computer science vs. the real world”
, 10 min read
14 thoughts on “Quickly returning the top-k elements: computer science vs. the real world”
Hi, thanks for the nice post and a couple of comments:
– Depending on the application, sorting may fail badly. In web search, for instance, each server (in the data center) is responsible for scoring millions of documents and returning top-k, where k is very small (like 10 or 100). So, using a heap may yield a huge performance gain (+ there is no need for an array for millions of docs, a gain from the space and advantage for multi-core parallelism).
Databases is another story and I would agree on your observation (although there was once a huge interest on making such top-k queries efficient; and at one point (back in 2004) we have also proposed an extended SQL algebra for handling tuple scores natively in databases, together with processing algorithms that would also exploit such scores for faster processing). Nevertheless, I am not aware of industrial adoption of such ideas, but maybe in some research prototypes…
How about a simple-minded insertion sort into an array of size K. After a while most entries would be immediately rejected.
Insertion sort could be a simpler solution but it won’t beat binary heap solution because of its average and worst case compexity is O(n^2).
Vertica does top-K optimization. A lot of queries wouldn’t finish without it.
Note that from the computer science perspective it is possible to do better than the heap solution:
1. find the Kth smallest element from the list using the quick select algorithm. This takes time O(N)
2. Use the Kth smallest element as the pivot to partition the array and only keep the elements smaller than the pivot (similar to what is done in quicksort). This takes time O(N).
The resulting algorithm is linear and thus optimal. If the result needs to be sorted this simply requires an extra O(K log K) additional time, for an overall running time of O(N + K log K).
However, this algorithm seems to have worse locality of reference than the heap-based one, so I wouldn’t expect it to perform better in pratice. It could be interesting to try though
Did you try QuickSelect followed by a linear scan? Should outperform the heap at least if k is not too small.
(guava use this)
For the sake of the comparison, one could use guava greatestOf/leastOf that runs in O(n + k log k). In my tests, it outperform any other of the shelf approaches by a large margin for any combination of K and N.
https://plus.google.com/+googleguava/posts/QMD74vZ5dxc
The database engine I work on (Impala) uses the binary heap solution for top-k (ORDER BY/LIMIT) queries. It’s a common pattern in practice (and in benchmarks). I suspect other engines do too. Maybe Postgres is an outlier.
As coincidence would have it, I just ran across the Postgres commit for top-N sorting.https://postgrespro.com/list/thread-id/1270227Apparently it shows up in the plan as a regular sort, since it uses the same module and switches modes only when the tuple count is large: Â https://github.com/postgres/postgres/blob/2df537e43fdc432cccbe64de166ac97363cbca3c/src/backend/utils/sort/tuplesort.c#L1587
I got exactly same question in Amazon interivew and like an average programmer I followed the sorting approach and I failed the interview miserably. Then I spend some time on what went wrong and later found what went wrong with me and I came to know about binary heap way of solving this problem. But this RDBMS is good way of solving this approach in real world senarios. Also people have shared other solutions too like @Thibaut have shared a good solution i would definitely like to give it a try
Hello Daniel,
There’s a bit of inconsistency in this post:
– you say you’re generating 1408 random integers but then say blocks is set to 10
– you say you’re looking for the 128 smallest integers but with defaultcomparator you’re polling the smallest one each time
you say you’re generating 1408 random integers but then say blocks is set to 10
There are (blocks + 1) * 128 values being generated. You can verify that 11 * 128 is indeed 1408.
you say you’re looking for the 128 smallest integers but with defaultcomparator you’re polling the smallest one each time
Yes, I have updated the blog post to reflect the fact that we seek the 128 largest values.
I also corrected a critical mistake in the code samples: the sorting code was flat out wrong (incredibly).