Thanks Suresh. To implement a “Knuth shuffle”, don’t you need random access to the items to swap them? For my problem, I specify a “variable-length-record flat file”. If I am allowed to turn it into a fixed-length-record flat file with random access to the lines, I might as well throw the data in a DBMS. Hence, there is the idea of shuffling the numbers from 1 to n, which can be done in O(n) time using the Knuth shuffle [which languages like Java or Python conveniently provide], prepend them to the lines, and then sorting the lines…
Or maybe you mean that I can scale up the shuffling of the numbers from 1 to n to values of n much greater than 100,000,000? Well, maybe I was being a pessimist…
Sureshsays:
What’s the problem with a shuffle ? using the standard Knuth trick, you can generate a shuffle as easily as choosing random numbers.
Sureshsays:
True, but you’re only shuffling the range [1..n], not the actual numbers, no ? I’m guessing that it won’t scale beyond what you can fit in memory though.
I edited my post to remove the CRC64 idea since I believe that “sort –random-sort” does exactly this, though they don’t specify their hashing family.
Davidsays:
I like the CRC64 solution, although, why not go even crazier and use a SHA algorithm? The disadvantage of these approaches is that you need one pass to generate the random number (prepending it to the line) and another pass to sort the lines.
Unf. my sha1sum command doesn’t seem to accept input from the command line, making it difficult to answer with a nice one liner.
Thanks Suresh. To implement a “Knuth shuffle”, don’t you need random access to the items to swap them? For my problem, I specify a “variable-length-record flat file”. If I am allowed to turn it into a fixed-length-record flat file with random access to the lines, I might as well throw the data in a DBMS. Hence, there is the idea of shuffling the numbers from 1 to n, which can be done in O(n) time using the Knuth shuffle [which languages like Java or Python conveniently provide], prepend them to the lines, and then sorting the lines…
Or are you thinking about some other algorithm? Here is my reference:
http://en.wikipedia.org/wiki/Knuth_shuffle
Or maybe you mean that I can scale up the shuffling of the numbers from 1 to n to values of n much greater than 100,000,000? Well, maybe I was being a pessimist…
What’s the problem with a shuffle ? using the standard Knuth trick, you can generate a shuffle as easily as choosing random numbers.
True, but you’re only shuffling the range [1..n], not the actual numbers, no ? I’m guessing that it won’t scale beyond what you can fit in memory though.
Thanks David.
I edited my post to remove the CRC64 idea since I believe that “sort –random-sort” does exactly this, though they don’t specify their hashing family.
I like the CRC64 solution, although, why not go even crazier and use a SHA algorithm? The disadvantage of these approaches is that you need one pass to generate the random number (prepending it to the line) and another pass to sort the lines.
Unf. my sha1sum command doesn’t seem to accept input from the command line, making it difficult to answer with a nice one liner.