Why not put it on Google Code right away? They have a nice code review tool that works better than pastebin to comment code or ask a question on a specific line.
Philippe Beaudoinsays:
I like your approach! Great use of PriorityQueue and the Comparable interface on your BinaryFileBuffer. Short and simple.
I made a couple of changes on pastebin to get rid of a potential infinite loop. But beside that, it looks good to me. (Although I think 8 spaces is way too much for tabs. I go with 2. 😉 )
My command of Java is quite poor, but that’s beside the point: I did not originally understand the algorithm. If I have a list of, say, 100 numbers, split that list into 10 lists of 10 numbers each, sort each of them individually and then merge the result, I will still need to sort the new 100-number list. However, there are advantages to doing things that way – which are explained here: http://en.wikipedia.org/wiki/Mergesort
“1- A small list will take fewer steps to sort than a large list.
2- Fewer steps are required to construct a sorted list from two sorted lists than two unsorted lists. For example, you only have to traverse each list once if they’re already sorted […].”
This old pony learnt a new trick today…
Philippe Beaudoinsays:
I think the key is that you don’t have to load the smaller files entirely: just read the first number in each of the 10 files and you’ll know the smallest (remaining) entry. Pop the smallest entry out out of the file where you found it and send it directly to the output file. As a result, you never had to keep the 100 numbers in memory.
I think you can make the BinaryFileBuffer class simpler by removing the ‘buf’ list and instead increasing the size of the buffer internal to the BufferedReader.
Reading through the code I think found one unneeded conditional:
while(line != null) {
When I created a big file to test, I ran out of memory. Increasing the
heap size, java -Xmx128m or more, it only ever created one temp file.
I believe that’s because you check free memory rather than what has
been used so far:
while((Runtime.getRuntime().freeMemory() > 2097152)
&&( (line = fbr.readLine()) != null) ){ // as long as you have 2MB
tmplist.add(line);
If more than one file was created, I do not know the behaviour if this
would overwrite:
File newtmpfile = File.createTempFile(“sortInBatch”, “flatfile”);
As it is, for a large file, the Java version is faster, however
sometimes the result is not what I get on the command-line with sort
(I get a smaller file!). I haven’t been able to properly isolate the bug. It
only happens with fairly large files though.
$ time java -Xmx512m ExternalSort to_sort10.txt to_sort10.txt.java.out
real 0m24.436s
user 0m24.937s
sys 0m1.669s
$ time sort to_sort10.txt > to_sort10.txt.sort.out
I suppose there are some issues with your code:
– The check of remaining memory (Runtime.getRuntime().freeMemory() > 2097152) is not valid. The Vector will douple its size if it ran out of space. This may allocate more than 2Mb.
– All files to merge are kept open. The OS could run out of file handles if there are to many files to merge.
– The files to merge are not closed savely. If an exception is thrown they may be still opened.
Alternative approaches merge files in multiple runs until there are no files to merge left. This is probably only needed here if too many files have to be merged at once.
Changed instances of Vector to ArrayList. Vector is synchronized, which is unnecessary in this context.
http://pastebin.com/zNKCYL5H
(I also agree — nice simple use of PriorityQueue. Might have to steal this idiom 🙂 )
I don’t look deeply in your implementation
There an open source (LGPL) implementation
referenced in this thread
http://forums.sun.com/thread.jspa?threadID=5310310 in colaboration with yahoo
Witch seem to have “good” performance
My 2 cents
Thierry
Why not put it on Google Code right away? They have a nice code review tool that works better than pastebin to comment code or ask a question on a specific line.
I like your approach! Great use of PriorityQueue and the Comparable interface on your BinaryFileBuffer. Short and simple.
I made a couple of changes on pastebin to get rid of a potential infinite loop. But beside that, it looks good to me. (Although I think 8 spaces is way too much for tabs. I go with 2. 😉 )
Forgot the link:
http://pastebin.com/yriK3tKf
My command of Java is quite poor, but that’s beside the point: I did not originally understand the algorithm. If I have a list of, say, 100 numbers, split that list into 10 lists of 10 numbers each, sort each of them individually and then merge the result, I will still need to sort the new 100-number list. However, there are advantages to doing things that way – which are explained here: http://en.wikipedia.org/wiki/Mergesort
“1- A small list will take fewer steps to sort than a large list.
2- Fewer steps are required to construct a sorted list from two sorted lists than two unsorted lists. For example, you only have to traverse each list once if they’re already sorted […].”
This old pony learnt a new trick today…
I think the key is that you don’t have to load the smaller files entirely: just read the first number in each of the 10 files and you’ll know the smallest (remaining) entry. Pop the smallest entry out out of the file where you found it and send it directly to the output file. As a result, you never had to keep the 100 numbers in memory.
I think you can make the BinaryFileBuffer class simpler by removing the ‘buf’ list and instead increasing the size of the buffer internal to the BufferedReader.
The performance was about the same in my tests.
Here is the change: http://pastebin.com/ene9CB8i
I propose https://gist.github.com/c87c3a3c92889a773d15 to make BinaryFileBuffer more cohesive
Reading through the code I think found one unneeded conditional:
while(line != null) {
When I created a big file to test, I ran out of memory. Increasing the
heap size, java -Xmx128m or more, it only ever created one temp file.
I believe that’s because you check free memory rather than what has
been used so far:
while((Runtime.getRuntime().freeMemory() > 2097152)
&&( (line = fbr.readLine()) != null) ){ // as long as you have 2MB
tmplist.add(line);
If more than one file was created, I do not know the behaviour if this
would overwrite:
File newtmpfile = File.createTempFile(“sortInBatch”, “flatfile”);
As it is, for a large file, the Java version is faster, however
sometimes the result is not what I get on the command-line with sort
(I get a smaller file!). I haven’t been able to properly isolate the bug. It
only happens with fairly large files though.
$ time java -Xmx512m ExternalSort to_sort10.txt to_sort10.txt.java.out
real 0m24.436s
user 0m24.937s
sys 0m1.669s
$ time sort to_sort10.txt > to_sort10.txt.sort.out
real 2m41.615s
user 2m35.640s
sys 0m1.876s
-rw-r–r– 1 danielharan danielharan 54288300 4 Apr 00:55 to_sort10.txt
-rw-r–r– 1 danielharan danielharan 39720054 4 Apr 01:21
to_sort10.txt.java.out #EEEK
-rw-r–r– 1 danielharan danielharan 54288300 4 Apr 01:26
to_sort10.txt.sort.out
I suppose there are some issues with your code:
– The check of remaining memory (Runtime.getRuntime().freeMemory() > 2097152) is not valid. The Vector will douple its size if it ran out of space. This may allocate more than 2Mb.
– All files to merge are kept open. The OS could run out of file handles if there are to many files to merge.
– The files to merge are not closed savely. If an exception is thrown they may be still opened.
Alternative approaches merge files in multiple runs until there are no files to merge left. This is probably only needed here if too many files have to be merged at once.