Thanks. These were additional edits that I forgot to discard.
Marcin Zukowskisays:
Pretty cool!
Yeah, reading the tail of a stream is often tricky to get both fast and correct. I’ve seen a similar problem for hash functions for short strings, came up with a different approach (probably also applicable here), where we read “garbage” memory, but we can guarantee it’s “safe” by checking if we’re close to the page boundary. I think t1ha folks included that in their project. See here if you’re curious: https://marcinzukowski.github.io/hash-tailer/.
And thanks for your awesome blog posts in general! 🙂
I’ve mitigated this problem by manually calling resize() on vectors to enlarge them sufficiently beyond their size such that over-reads are guaranteed to not segfault, but that’s an expensive solution if it causes a large copy.
A better solution seems to align every allocation that might be accessed with vectorized loads to the largest vector size that it will be accessed with, because then it’s guaranteed that reads of that size will not over-run a page. Easy to do in C with aligned_alloc, unfortunately pretty un-ergonomic to do with vectors in C++. c++17’s aligned new helps, but it’s still messy: https://stackoverflow.com/questions/60169819/modern-approach-to-making-stdvector-allocate-aligned-memory
I have not considered any potential downsides of having most large allocations aligned to m512i, if there are any.
Overallocation is certainly a viable approach but you cannot always rely on it being done (by libraries you are using).
Matthew Stoltenbergsays:
What I’ve found works really well for me at my current job is to write the main part of the loop as an inline function taking the __m512/svfloat32_t registers and any mask/predicates as variables. This allows me to not repeat myself for the main part of the loop and the remainders.
You should always measure but I believe it is true (so far) and that’s what I claim in the blog post. The computation of the masks is not free, however.
[A little late to the party…] Sometimes, those masked load/stores also could become hot if the number of processed elements is low (<16, loop never executes). I solved such an issue simply by allocating 16 dummy bytes (for SSE) at the end to avoid access violations by crossing page boundaries.
It seems that the footnote is missing, and another sentence is truncated: “It could also discourage the”.
Thanks. These were additional edits that I forgot to discard.
Pretty cool!
Yeah, reading the tail of a stream is often tricky to get both fast and correct. I’ve seen a similar problem for hash functions for short strings, came up with a different approach (probably also applicable here), where we read “garbage” memory, but we can guarantee it’s “safe” by checking if we’re close to the page boundary. I think t1ha folks included that in their project. See here if you’re curious: https://marcinzukowski.github.io/hash-tailer/.
And thanks for your awesome blog posts in general! 🙂
Thanks Marcin. You may not remember, but we once spent a week together in Germany !
Of course, Dagstuhl, one of my fave places in the world 🙂
https://lemire.me/img/news/2018/dagstuhl.18251.04.jpg
I’ve mitigated this problem by manually calling resize() on vectors to enlarge them sufficiently beyond their size such that over-reads are guaranteed to not segfault, but that’s an expensive solution if it causes a large copy.
A better solution seems to align every allocation that might be accessed with vectorized loads to the largest vector size that it will be accessed with, because then it’s guaranteed that reads of that size will not over-run a page. Easy to do in C with aligned_alloc, unfortunately pretty un-ergonomic to do with vectors in C++. c++17’s aligned new helps, but it’s still messy: https://stackoverflow.com/questions/60169819/modern-approach-to-making-stdvector-allocate-aligned-memory
I have not considered any potential downsides of having most large allocations aligned to m512i, if there are any.
Overallocation is certainly a viable approach but you cannot always rely on it being done (by libraries you are using).
What I’ve found works really well for me at my current job is to write the main part of the loop as an inline function taking the __m512/svfloat32_t registers and any mask/predicates as variables. This allows me to not repeat myself for the main part of the loop and the remainders.
I agree that it seems like a good design.
But doen’t that assume that the versions with and without _maskz (on x86) are equivalent in performance? Perhaps they are, but I’d double check 🙂
You should always measure but I believe it is true (so far) and that’s what I claim in the blog post. The computation of the masks is not free, however.
[A little late to the party…] Sometimes, those masked load/stores also could become hot if the number of processed elements is low (<16, loop never executes). I solved such an issue simply by allocating 16 dummy bytes (for SSE) at the end to avoid access violations by crossing page boundaries.