, 4 min read
Performance of ranged accesses into arrays: modulo, multiply-shift and masks
5 thoughts on “Performance of ranged accesses into arrays: modulo, multiply-shift and masks”
, 4 min read
5 thoughts on “Performance of ranged accesses into arrays: modulo, multiply-shift and masks”
It seems like your modulo-approximation trick is very useful. I am certainly gonna switch to using it at some point. Should make all the hash-tricking much faster.
Unless my memory is flawed, I think gather instructions are not particularly microarchitecturally efficient on current Core CPUs. (Intel has hinted that this might improve in the future, but under current architectural assumptions it would be very hard to fetch data from more than two cache lines in a cycle nonetheless.) Anyway, this should be offset by vectorisation of other operations (and resulting reduced amount of instructions, allowing lots of iterations to be reordered), if there is a sufficient amount of them…
Gather is reasonably efficient on Skylake (and not terrible on Broadwell). It is definitely still limited by the 2 loads per cycle, but it is comparable to scalar loads now without much cost for getting all the elements into the vector register. So if you actually wanted the loaded values in a vector register, it is “ideal”.
There isn’t any optimziation of overlapping or duplicate values or anything though, so if many of your indices are the same, you don’t get any speedup.
Interestingly, looking at Agner Fog’s instruction tables, Skylake client would seem to have (at least latency/uop-wise) roughly as efficient gather instructions as one might expect from a good implementation on that microarchitecture. At the same time Skylake X doesn’t perform that well, and earlier architectures do also significantly worse than Skylake client. Anyway, on Skylake Intel has fulfilled their promise! (Then again… on AMD Ryzen, gather instructions are pretty awful.)
Reducing amount of cache accesses, especially for the hash table use case doesn’t seem like a particularly attractive proposition. Custom logic would be pretty heavy and benefit would materialise mostly on couple-kilobyte hash tables…
It got worse on Skylake-X? That’s weird…