FWIW for large data arrays I’d recommend a 3-pass radix sort (using 11, 11 and 10 bits of the input values). This requires a bit more space, with the histogram/offset array adding up to 20 KB but that still fits within L1 reasonably comfortably and as such usually wins significantly on moderately large inputs.
Travis Downssays:
At least with a C++ LSD radix sort, I found 8 bits generally better than 11 bits: performance started dropping off heavily right after 8 bits as there are too many buckets to be easily held in the L1 cache (typically has 512 cache lines on x86) and the L1 TLB.
Of course, this depends a lot on the hardware, the size of the data being sorted, and other details of the implementation (e.g., if there is an intermediate buffering step where smaller buckets are accumulated before being copied out to the final buckets, larger radices may perform better).
Especially, it depends on the data distribution: if many fewer than available the 2^radix buckets are actually used, larger radices are better since the caching penalty is reduced or eliminated. In principle, one could look at the histogram and try to pick the radix dynamically based on what’s likely to be good for that distribution.
Travis Downssays:
I want to revoke my use of “generally better” in the first sentence. I should really say “I found 8 bits better in my specific scenario of sorting uniformly random values”.
You’re right; for arrays of random integers indeed 4-pass sort seems to often win by a bit. I need to update my priors; my data was mostly based on experience with in-order Cell systems, and I haven’t rebenchmarked this specific problem on modern Intel chips until now…
FWIW I’ve ran the benchmark from https://gist.github.com/zeux/2e4edeef8a09ee0bfab3ec76858aaec1 on an AMD Ryzen 5900X and I am getting different results from Intel – on Intel 4-pass is winning by a fair margin, but on Ryzen 3-pass is noticeably faster on sorting ints (and still a touch slower on floats but the difference isn’t very significant).
George Spelvinsays:
It’s possible to save some memory and only keep two levels’ histograms in memory at one time by creating the next level’s histogram as you’re distributing the current one.
But this probably isn’t worth it. Each histogram requires only 1KiB (it’s not hard to adjust the prefix sum loop to get rid of the 257th element), but you need 256 cache lines (16 KiB) for the active head of each bucket.
That’s probably why going to 11 bits is a disaster; a typical 64K L1D only has 1024 lines. So randomly writing to 2048 buffers is going to thrash the hell out of it.
FWIW for large data arrays I’d recommend a 3-pass radix sort (using 11, 11 and 10 bits of the input values). This requires a bit more space, with the histogram/offset array adding up to 20 KB but that still fits within L1 reasonably comfortably and as such usually wins significantly on moderately large inputs.
At least with a C++ LSD radix sort, I found 8 bits generally better than 11 bits: performance started dropping off heavily right after 8 bits as there are too many buckets to be easily held in the L1 cache (typically has 512 cache lines on x86) and the L1 TLB.
Of course, this depends a lot on the hardware, the size of the data being sorted, and other details of the implementation (e.g., if there is an intermediate buffering step where smaller buckets are accumulated before being copied out to the final buckets, larger radices may perform better).
Especially, it depends on the data distribution: if many fewer than available the
2^radix
buckets are actually used, larger radices are better since the caching penalty is reduced or eliminated. In principle, one could look at the histogram and try to pick the radix dynamically based on what’s likely to be good for that distribution.I want to revoke my use of “generally better” in the first sentence. I should really say “I found 8 bits better in my specific scenario of sorting uniformly random values”.
This is my experience also with 11-bit radix. It was surprisingly bad (on random input). I’m surprised to see it recommended, honestly.
I think you’re better off using 8-bits and relying on column-skipping to shave off (likely high-order bits) for when the data isn’t random.
You’re right; for arrays of random integers indeed 4-pass sort seems to often win by a bit. I need to update my priors; my data was mostly based on experience with in-order Cell systems, and I haven’t rebenchmarked this specific problem on modern Intel chips until now…
fwiw source https://gist.github.com/zeux/2e4edeef8a09ee0bfab3ec76858aaec1
FWIW I’ve ran the benchmark from https://gist.github.com/zeux/2e4edeef8a09ee0bfab3ec76858aaec1 on an AMD Ryzen 5900X and I am getting different results from Intel – on Intel 4-pass is winning by a fair margin, but on Ryzen 3-pass is noticeably faster on sorting ints (and still a touch slower on floats but the difference isn’t very significant).
It’s possible to save some memory and only keep two levels’ histograms in memory at one time by creating the next level’s histogram as you’re distributing the current one.
But this probably isn’t worth it. Each histogram requires only 1KiB (it’s not hard to adjust the prefix sum loop to get rid of the 257th element), but you need 256 cache lines (16 KiB) for the active head of each bucket.
That’s probably why going to 11 bits is a disaster; a typical 64K L1D only has 1024 lines. So randomly writing to 2048 buffers is going to thrash the hell out of it.