Daniel Lemire's blog

, 9 min read

External-Memory Sorting in Java

11 thoughts on “External-Memory Sorting in Java”

  1. Jon Elsas says:

    Changed instances of Vector to ArrayList. Vector is synchronized, which is unnecessary in this context.

    http://pastebin.com/zNKCYL5H

    (I also agree — nice simple use of PriorityQueue. Might have to steal this idiom 🙂 )

  2. Thierry Faure says:

    I don’t look deeply in your implementation

    There an open source (LGPL) implementation
    referenced in this thread
    http://forums.sun.com/thread.jspa?threadID=5310310 in colaboration with yahoo
    Witch seem to have “good” performance

    My 2 cents
    Thierry

  3. Philippe Beaudoin says:

    Why not put it on Google Code right away? They have a nice code review tool that works better than pastebin to comment code or ask a question on a specific line.

  4. Philippe Beaudoin says:

    I like your approach! Great use of PriorityQueue and the Comparable interface on your BinaryFileBuffer. Short and simple.

    I made a couple of changes on pastebin to get rid of a potential infinite loop. But beside that, it looks good to me. (Although I think 8 spaces is way too much for tabs. I go with 2. 😉 )

  5. Philippe Beaudoin says:

    Forgot the link:
    http://pastebin.com/yriK3tKf

  6. Francois Laflamme says:

    My command of Java is quite poor, but that’s beside the point: I did not originally understand the algorithm. If I have a list of, say, 100 numbers, split that list into 10 lists of 10 numbers each, sort each of them individually and then merge the result, I will still need to sort the new 100-number list. However, there are advantages to doing things that way – which are explained here: http://en.wikipedia.org/wiki/Mergesort

    “1- A small list will take fewer steps to sort than a large list.
    2- Fewer steps are required to construct a sorted list from two sorted lists than two unsorted lists. For example, you only have to traverse each list once if they’re already sorted […].”

    This old pony learnt a new trick today…

  7. Philippe Beaudoin says:

    I think the key is that you don’t have to load the smaller files entirely: just read the first number in each of the 10 files and you’ll know the smallest (remaining) entry. Pop the smallest entry out out of the file where you found it and send it directly to the output file. As a result, you never had to keep the 100 numbers in memory.

  8. I think you can make the BinaryFileBuffer class simpler by removing the ‘buf’ list and instead increasing the size of the buffer internal to the BufferedReader.

    The performance was about the same in my tests.

    Here is the change: http://pastebin.com/ene9CB8i

  9. Daniel Haran says:

    I propose https://gist.github.com/c87c3a3c92889a773d15 to make BinaryFileBuffer more cohesive

  10. Daniel Haran says:

    Reading through the code I think found one unneeded conditional:
    while(line != null) {

    When I created a big file to test, I ran out of memory. Increasing the
    heap size, java -Xmx128m or more, it only ever created one temp file.
    I believe that’s because you check free memory rather than what has
    been used so far:
    while((Runtime.getRuntime().freeMemory() > 2097152)
    &&( (line = fbr.readLine()) != null) ){ // as long as you have 2MB
    tmplist.add(line);

    If more than one file was created, I do not know the behaviour if this
    would overwrite:
    File newtmpfile = File.createTempFile(“sortInBatch”, “flatfile”);

    As it is, for a large file, the Java version is faster, however
    sometimes the result is not what I get on the command-line with sort
    (I get a smaller file!). I haven’t been able to properly isolate the bug. It
    only happens with fairly large files though.

    $ time java -Xmx512m ExternalSort to_sort10.txt to_sort10.txt.java.out

    real 0m24.436s
    user 0m24.937s
    sys 0m1.669s

    $ time sort to_sort10.txt > to_sort10.txt.sort.out

    real 2m41.615s
    user 2m35.640s
    sys 0m1.876s

    -rw-r–r– 1 danielharan danielharan 54288300 4 Apr 00:55 to_sort10.txt
    -rw-r–r– 1 danielharan danielharan 39720054 4 Apr 01:21
    to_sort10.txt.java.out #EEEK
    -rw-r–r– 1 danielharan danielharan 54288300 4 Apr 01:26
    to_sort10.txt.sort.out

  11. Thomas Jung says:

    I suppose there are some issues with your code:
    – The check of remaining memory (Runtime.getRuntime().freeMemory() > 2097152) is not valid. The Vector will douple its size if it ran out of space. This may allocate more than 2Mb.
    – All files to merge are kept open. The OS could run out of file handles if there are to many files to merge.
    – The files to merge are not closed savely. If an exception is thrown they may be still opened.

    Alternative approaches merge files in multiple runs until there are no files to merge left. This is probably only needed here if too many files have to be merged at once.