6th June 2017, 10 min read

Quickly returning the top-k elements: computer science vs. the real world

14 thoughts on “Quickly returning the top-k elements: computer science vs. the real world”

Ismail Sengor Altingovde says:

June 6, 2017 at 1:41 am

Hi, thanks for the nice post and a couple of comments:
– Depending on the application, sorting may fail badly. In web search, for instance, each server (in the data center) is responsible for scoring millions of documents and returning top-k, where k is very small (like 10 or 100). So, using a heap may yield a huge performance gain (+ there is no need for an array for millions of docs, a gain from the space and advantage for multi-core parallelism).

Databases is another story and I would agree on your observation (although there was once a huge interest on making such top-k queries efficient; and at one point (back in 2004) we have also proposed an extended SQL algebra for handling tuple scores natively in databases, together with processing algorithms that would also exploit such scores for faster processing). Nevertheless, I am not aware of industrial adoption of such ideas, but maybe in some research prototypes…
Martin Cohen says:

June 6, 2017 at 4:10 am

How about a simple-minded insertion sort into an array of size K. After a while most entries would be immediately rejected.
1. Haider Ali says:
  
  June 22, 2017 at 10:21 am
  
  Insertion sort could be a simpler solution but it won’t beat binary heap solution because of its average and worst case compexity is O(n^2).
KWillets says:

June 6, 2017 at 4:24 am

Vertica does top-K optimization. A lot of queries wouldn’t finish without it.
Thibaut says:

June 6, 2017 at 5:06 am

Note that from the computer science perspective it is possible to do better than the heap solution:

1. find the Kth smallest element from the list using the quick select algorithm. This takes time O(N)

2. Use the Kth smallest element as the pivot to partition the array and only keep the elements smaller than the pivot (similar to what is done in quicksort). This takes time O(N).

The resulting algorithm is linear and thus optimal. If the result needs to be sorted this simply requires an extra O(K log K) additional time, for an overall running time of O(N + K log K).

However, this algorithm seems to have worse locality of reference than the heap-based one, so I wouldn’t expect it to perform better in pratice. It could be interesting to try though
Rasmus Pagh says:

June 6, 2017 at 5:58 am

Did you try QuickSelect followed by a linear scan? Should outperform the heap at least if k is not too small.
1. Michel Lemay says:
  
  June 6, 2017 at 1:36 pm
  
  (guava use this)
Michel Lemay says:

June 6, 2017 at 1:35 pm

For the sake of the comparison, one could use guava greatestOf/leastOf that runs in O(n + k log k). In my tests, it outperform any other of the shelf approaches by a large margin for any combination of K and N.

https://plus.google.com/+googleguava/posts/QMD74vZ5dxc
Tim Armstrong says:

June 6, 2017 at 2:31 pm

The database engine I work on (Impala) uses the binary heap solution for top-k (ORDER BY/LIMIT) queries. It’s a common pattern in practice (and in benchmarks). I suspect other engines do too. Maybe Postgres is an outlier.
KWillets says:

June 18, 2017 at 11:30 am

As coincidence would have it, I just ran across the Postgres commit for top-N sorting.https://postgrespro.com/list/thread-id/1270227Apparently it shows up in the plan as a regular sort, since it uses the same module and switches modes only when the tuple count is large: Â https://github.com/postgres/postgres/blob/2df537e43fdc432cccbe64de166ac97363cbca3c/src/backend/utils/sort/tuplesort.c#L1587
Haider Ali says:

June 22, 2017 at 10:17 am

I got exactly same question in Amazon interivew and like an average programmer I followed the sorting approach and I failed the interview miserably. Then I spend some time on what went wrong and later found what went wrong with me and I came to know about binary heap way of solving this problem. But this RDBMS is good way of solving this approach in real world senarios. Also people have shared other solutions too like @Thibaut have shared a good solution i would definitely like to give it a try
Gregory says:

July 18, 2017 at 9:48 am

Hello Daniel,

There’s a bit of inconsistency in this post:
– you say you’re generating 1408 random integers but then say blocks is set to 10
– you say you’re looking for the 128 smallest integers but with defaultcomparator you’re polling the smallest one each time
1. Daniel Lemire says:
  
  July 18, 2017 at 2:50 pm
  
  you say you’re generating 1408 random integers but then say blocks is set to 10
  
  There are (blocks + 1) * 128 values being generated. You can verify that 11 * 128 is indeed 1408.
2. Daniel Lemire says:
  
  July 18, 2017 at 2:57 pm
  
  you say you’re looking for the 128 smallest integers but with defaultcomparator you’re polling the smallest one each time
  
  Yes, I have updated the blog post to reflect the fact that we seek the 128 largest values.
  
  I also corrected a critical mistake in the code samples: the sorting code was flat out wrong (incredibly).