I’d be fine with quasi-shuffling, as long as it has nice properties.
My practical motivation is this: I’m annoyed that no standard Unix tool does random line shuffling over very large files.
My second motivation is that I’m annoyed people consider the shuffling problem “solved” by assuming that you have fixed-length records.
And finally, I think you cannot just “shuffle” locally. Think about a sorted file. If you shuffle it only locally, it will still be “almost sorted” (no line starting with the letter ‘z’ will appear in the first few lines).
The algorithm I propose is close to a classical merge-sort algorithm. It fails to be linear (I think) because picking blocks at random, with uneven probabilities, is hard.
Maybe this can be fixed.
Christopher Hardinsays:
(Sorry to reply to something ancient.)
Your proposed algorithm (“A better solution?”) looks linear to me (assuming we treat generating a random integer from a range as O(1)). I don’t follow your point about uneven probabilities: You can assign each string uniformly to one of the temporary files. The result is still an unbiased shuffle. (Strictly speaking, there’s an astronomically small chance that one of the temp files gets so big you can’t shuffle it in memory. Start over or recurse if that happens; the algorithm is still unbiased and the expected extra cost from the possible reruns is very low.)
Using a random-keyed sort to implement a shuffle is well known to give rise to biased distributions, so that method won’t work without some heroic effort.
If you can identify a record boundary by scanning backwards (always true for normal Unix text files), it’s simple to use Fisher-Yates shuffle with external storage and random access, in just the same way as you’d implement binary search. I’ve a suspicion that the time complexity isn’t theoretically linear, though.
Johnsays:
I don’t think assigning a fixed random key to each record (or using a good hash function) is biased. Only using a random, record independent result for the sort comparison function will create bias.
First question – what is the aim of this exercise?
You probably do not need a completely-random shuffle over the entire bulk of the file. This makes a large difference in the appropriate algorithms. What do you need to get meaningful results from your current exercise?
Fair enough. Note that I was thinking more along the lines of classic sort-merge algorithm, except shuffling rather than sorting. Runtime could be effectively linear and equal to a read/write of all the file data, twice (with minimal seeks). Theoretically there would be a non-linear component, but with a very much smaller constant multiplier.
If local shuffling were sufficient, the runtime could be reduced to a single read/write of the input.
Yes, this is the pragmatist talking to the theoretician. 🙂
The reason for the lack of a generic optimal random file-shuffle is that the underlying reason for wanting a shuffle is not, um … random. The base requirement changes which algorithms are most suitable.
Still not sure about your base need, but try this:
1. Create N files as output buckets.
2. For each line and use a hash to choose the output bucket.
3. For each line written to the final output, read one line from a random bucket.
4. Measure and adjust.
The hash could be a random number generator. Note you do not want:
* Too many buckets (OS limits on number of open files that can be efficiently handled).
* Too-small buckets (inefficient small I/O).
* Hashing that clusters with negative effect on your base purpose.
With careful use of buffering you can make this run at full disk read/write rates with minimal CPU usage.
(Way do I always notice needed edits right AFTER posting??)
J.D. Parksays:
It looks to me like it isn’t possible. The survey paper “External Memory Algorithms and Data Structures: Dealing with Massive Data” by Jeffrey Scott Vitter says that permuting has the same IO complexity as sorting for non-trivial block sizes (its in section 5.5 if you are interested). That seems to imply that a true linear IO complexity external shuffle isn’t possible even with fixed size records.
Craig Kellysays:
You could create an “index file” for variable-length records where each fixed-size record has 2 fields: offset and length. This fixed-length file could be shuffled and then used to re-read the original file.
While perhaps not as elegant as Preston L. Bannister’s solution, it would be pretty simple to implement.
@Craig Yes, it would be simple to implement, but it *might* be quite a bit slower. Even with fixed-length records, the Fisher-Yates algorithm might just be fundamentally slow when used with external memory. Indeed, how do you leverage your fast internal memory? Bannister’s algorithm is nice because a good part of the work is done in RAM. You don’t write all over the disk, all the time.
Nick Downingsays:
If memory size is e.g. 100MB, I was thinking of reading 100MB at a time from the input, shuffling it and writing it out to a tempfile so that I have tempfile.1, tempfile.2, …, tempfile.n each 100MB in size.
Then I perform an n-way merge in such a way that if there are c1 items remaining in tempfile.1, c2 items remaining in tempfile.2 and so on up to cn, then the probability that an item is taken from the head of tempfile.i would be ci/(c1 + c2 + … + cn), is this correct?
The only real issue with this is I want my shuffling to be repeatable based on the seed given by the user in the beginning, so if I do it this way then the output will depend on both seed AND memory size. I could set some fixed and conservative memory size such as 100MB and always shuffle on that basis, but this might get inefficient for really big files, say 10GB since it’d have to create 100 tempfiles.
Sadly, what you describe is maybe not a fair shuffle. You need to show that all N! possible permutations are equally likely. To do that, you can proceed a bit differently. Take each and every input value and move it to any one of n containers with equal probability. The catch here is that some of the containers might get quite a bit larger than others… you can’t help that. Shuffle container… Then merge it all.
Ivansays:
Why wouldn’t Nick’s recipe be fair? It seems to me that at any given time, if there are n remaining items, the probability of picking any particular remaining item is 1/n.
Let’s say a particular remaining item is in file i. The probability of picking the right file is ci / (c1 + c2 + … + cn), using weighted random selection. (Picking files with equal probabilities doesn’t work.)
Then, if you picked the right file i, the probability that the item will be at the head of the file is 1/ci, because the file is shuffled.
Multiply those two together and you get 1/(c1 + c2 + … + cn) == 1/n.
Isn’t the Fisher-Yates algorithm basically “pick one of the remaining input items at random with equal probability (i.e., 1/n), remove from input, add to output, rinse and repeat”? The above seems equivalent to me.
FWIW I did a little brute-force experiment, doing the above shuffle many times with small inputs such that the permutations can be fully sampled, and using buckets of different sizes to see if that introduces a bias, and I didn’t see any. For example, with 6 elements and 720,000 shuffles, all 720 permutations are seen 1000 +/- 100 times, which seems reasonable enough (within about 3 std. devs), regardless of whether I start with two 3-item buckets, or a 1-item bucket and a 5-item bucket, or an empty bucket and a 6-item bucket.
An easy (lazy) way to prove something is biased is just to calculate directly tbr probability for 2 or 3 elements. This is only a sufficient condition for bias, of course, not a necessary one: but it seems to work well in practice.
Note that doing weighted sampling quickly can be tricky.
In the worst case, you need to do a scan through a list of possible outcomes, and if this list of possible outcomes is even a bit large, it is going to be a bummer.
Jon Degenhardtsays:
I had this same need, but also for weighted shuffling. What I ended up doing was similar to the sort --random-sort unix pipeline strategy in your earlier blog post. However, it uses GNU sort’s “numeric” sort option which is much faster, combined with a sampling tool I wrote, tsv-sample. Performance will vary quite considerably between hardware and data sets. However, on a data set I used for testing processing went from 4 hours 53 minutes for sort --random-sort to 14 minutes. There is a more details description here: Shuffling large files, in the eBay TSV Utilities GitHub repository.
It’d be very interesting of course to implement the shuffling algorithm described in your blog post here. It should work nice for unweighted shuffling. I haven’t come up with a good approach for the weighted version though.
I just mean weighted random sampling (without replacement), except creating a full sampling order. The more common use case is to produce a subsample in the weighted order. Weighted reservoir sampling usually works for this. The algorithm I used is based on approaches described by Pavlos Efraimidis and Paul Spirakis. See Weighted Random Sampling over Data Streams. However, it can also be useful to be generate the weighted order for a full data set. Perhaps then it should be called something other than “shuffling”, don’t know.
@Bannister
I’d be fine with quasi-shuffling, as long as it has nice properties.
My practical motivation is this: I’m annoyed that no standard Unix tool does random line shuffling over very large files.
My second motivation is that I’m annoyed people consider the shuffling problem “solved” by assuming that you have fixed-length records.
And finally, I think you cannot just “shuffle” locally. Think about a sorted file. If you shuffle it only locally, it will still be “almost sorted” (no line starting with the letter ‘z’ will appear in the first few lines).
@Bannister
The algorithm I propose is close to a classical merge-sort algorithm. It fails to be linear (I think) because picking blocks at random, with uneven probabilities, is hard.
Maybe this can be fixed.
(Sorry to reply to something ancient.)
Your proposed algorithm (“A better solution?”) looks linear to me (assuming we treat generating a random integer from a range as O(1)). I don’t follow your point about uneven probabilities: You can assign each string uniformly to one of the temporary files. The result is still an unbiased shuffle. (Strictly speaking, there’s an astronomically small chance that one of the temp files gets so big you can’t shuffle it in memory. Start over or recurse if that happens; the algorithm is still unbiased and the expected extra cost from the possible reruns is very low.)
Using a random-keyed sort to implement a shuffle is well known to give rise to biased distributions, so that method won’t work without some heroic effort.
If you can identify a record boundary by scanning backwards (always true for normal Unix text files), it’s simple to use Fisher-Yates shuffle with external storage and random access, in just the same way as you’d implement binary search. I’ve a suspicion that the time complexity isn’t theoretically linear, though.
I don’t think assigning a fixed random key to each record (or using a good hash function) is biased. Only using a random, record independent result for the sort comparison function will create bias.
First question – what is the aim of this exercise?
You probably do not need a completely-random shuffle over the entire bulk of the file. This makes a large difference in the appropriate algorithms. What do you need to get meaningful results from your current exercise?
Fair enough. Note that I was thinking more along the lines of classic sort-merge algorithm, except shuffling rather than sorting. Runtime could be effectively linear and equal to a read/write of all the file data, twice (with minimal seeks). Theoretically there would be a non-linear component, but with a very much smaller constant multiplier.
If local shuffling were sufficient, the runtime could be reduced to a single read/write of the input.
Yes, this is the pragmatist talking to the theoretician. 🙂
The reason for the lack of a generic optimal random file-shuffle is that the underlying reason for wanting a shuffle is not, um … random. The base requirement changes which algorithms are most suitable.
Still not sure about your base need, but try this:
1. Create N files as output buckets.
2. For each line and use a hash to choose the output bucket.
3. For each line written to the final output, read one line from a random bucket.
4. Measure and adjust.
The hash could be a random number generator. Note you do not want:
* Too many buckets (OS limits on number of open files that can be efficiently handled).
* Too-small buckets (inefficient small I/O).
* Hashing that clusters with negative effect on your base purpose.
With careful use of buffering you can make this run at full disk read/write rates with minimal CPU usage.
(Way do I always notice needed edits right AFTER posting??)
It looks to me like it isn’t possible. The survey paper “External Memory Algorithms and Data Structures: Dealing with Massive Data” by Jeffrey Scott Vitter says that permuting has the same IO complexity as sorting for non-trivial block sizes (its in section 5.5 if you are interested). That seems to imply that a true linear IO complexity external shuffle isn’t possible even with fixed size records.
You could create an “index file” for variable-length records where each fixed-size record has 2 fields: offset and length. This fixed-length file could be shuffled and then used to re-read the original file.
While perhaps not as elegant as Preston L. Bannister’s solution, it would be pretty simple to implement.
@Craig Yes, it would be simple to implement, but it *might* be quite a bit slower. Even with fixed-length records, the Fisher-Yates algorithm might just be fundamentally slow when used with external memory. Indeed, how do you leverage your fast internal memory? Bannister’s algorithm is nice because a good part of the work is done in RAM. You don’t write all over the disk, all the time.
If memory size is e.g. 100MB, I was thinking of reading 100MB at a time from the input, shuffling it and writing it out to a tempfile so that I have tempfile.1, tempfile.2, …, tempfile.n each 100MB in size.
Then I perform an n-way merge in such a way that if there are c1 items remaining in tempfile.1, c2 items remaining in tempfile.2 and so on up to cn, then the probability that an item is taken from the head of tempfile.i would be ci/(c1 + c2 + … + cn), is this correct?
The only real issue with this is I want my shuffling to be repeatable based on the seed given by the user in the beginning, so if I do it this way then the output will depend on both seed AND memory size. I could set some fixed and conservative memory size such as 100MB and always shuffle on that basis, but this might get inefficient for really big files, say 10GB since it’d have to create 100 tempfiles.
Any ideas?
Sadly, what you describe is maybe not a fair shuffle. You need to show that all N! possible permutations are equally likely. To do that, you can proceed a bit differently. Take each and every input value and move it to any one of n containers with equal probability. The catch here is that some of the containers might get quite a bit larger than others… you can’t help that. Shuffle container… Then merge it all.
Why wouldn’t Nick’s recipe be fair? It seems to me that at any given time, if there are n remaining items, the probability of picking any particular remaining item is 1/n.
Let’s say a particular remaining item is in file i. The probability of picking the right file is ci / (c1 + c2 + … + cn), using weighted random selection. (Picking files with equal probabilities doesn’t work.)
Then, if you picked the right file i, the probability that the item will be at the head of the file is 1/ci, because the file is shuffled.
Multiply those two together and you get 1/(c1 + c2 + … + cn) == 1/n.
Isn’t the Fisher-Yates algorithm basically “pick one of the remaining input items at random with equal probability (i.e., 1/n), remove from input, add to output, rinse and repeat”? The above seems equivalent to me.
FWIW I did a little brute-force experiment, doing the above shuffle many times with small inputs such that the permutations can be fully sampled, and using buckets of different sizes to see if that introduces a bias, and I didn’t see any. For example, with 6 elements and 720,000 shuffles, all 720 permutations are seen 1000 +/- 100 times, which seems reasonable enough (within about 3 std. devs), regardless of whether I start with two 3-item buckets, or a 1-item bucket and a 5-item bucket, or an empty bucket and a 6-item bucket.
You might be correct.
An easy (lazy) way to prove something is biased is just to calculate directly tbr probability for 2 or 3 elements. This is only a sufficient condition for bias, of course, not a necessary one: but it seems to work well in practice.
Note that doing weighted sampling quickly can be tricky.
In the worst case, you need to do a scan through a list of possible outcomes, and if this list of possible outcomes is even a bit large, it is going to be a bummer.
I had this same need, but also for weighted shuffling. What I ended up doing was similar to the
sort --random-sort
unix pipeline strategy in your earlier blog post. However, it uses GNU sort’s “numeric” sort option which is much faster, combined with a sampling tool I wrote, tsv-sample. Performance will vary quite considerably between hardware and data sets. However, on a data set I used for testing processing went from 4 hours 53 minutes forsort --random-sort
to 14 minutes. There is a more details description here: Shuffling large files, in the eBay TSV Utilities GitHub repository.It’d be very interesting of course to implement the shuffling algorithm described in your blog post here. It should work nice for unweighted shuffling. I haven’t come up with a good approach for the weighted version though.
How do you define “weighted shuffling”?
I just mean weighted random sampling (without replacement), except creating a full sampling order. The more common use case is to produce a subsample in the weighted order. Weighted reservoir sampling usually works for this. The algorithm I used is based on approaches described by Pavlos Efraimidis and Paul Spirakis. See Weighted Random Sampling over Data Streams. However, it can also be useful to be generate the weighted order for a full data set. Perhaps then it should be called something other than “shuffling”, don’t know.