8th November 2022, 8 min read

Modern vector programming with masked loads and stores

12 thoughts on “Modern vector programming with masked loads and stores”

Markus Schaber says:

November 9, 2022 at 1:02 pm

It seems that the footnote is missing, and another sentence is truncated: “It could also discourage the”.
1. Daniel Lemire says:
  
  November 9, 2022 at 1:15 pm
  
  Thanks. These were additional edits that I forgot to discard.
Marcin Zukowski says:

November 9, 2022 at 8:31 pm

Pretty cool!

Yeah, reading the tail of a stream is often tricky to get both fast and correct. I’ve seen a similar problem for hash functions for short strings, came up with a different approach (probably also applicable here), where we read “garbage” memory, but we can guarantee it’s “safe” by checking if we’re close to the page boundary. I think t1ha folks included that in their project. See here if you’re curious: https://marcinzukowski.github.io/hash-tailer/.

And thanks for your awesome blog posts in general! 🙂
1. Daniel Lemire says:
  
  November 9, 2022 at 8:36 pm
  
  Thanks Marcin. You may not remember, but we once spent a week together in Germany !
  1. Marcin Zukowski says:
    
    November 9, 2022 at 8:38 pm
    
    Of course, Dagstuhl, one of my fave places in the world 🙂
    https://lemire.me/img/news/2018/dagstuhl.18251.04.jpg
Alex Petty says:

November 9, 2022 at 9:07 pm

I’ve mitigated this problem by manually calling resize() on vectors to enlarge them sufficiently beyond their size such that over-reads are guaranteed to not segfault, but that’s an expensive solution if it causes a large copy.

A better solution seems to align every allocation that might be accessed with vectorized loads to the largest vector size that it will be accessed with, because then it’s guaranteed that reads of that size will not over-run a page. Easy to do in C with aligned_alloc, unfortunately pretty un-ergonomic to do with vectors in C++. c++17’s aligned new helps, but it’s still messy: https://stackoverflow.com/questions/60169819/modern-approach-to-making-stdvector-allocate-aligned-memory

I have not considered any potential downsides of having most large allocations aligned to m512i, if there are any.
1. Daniel Lemire says:
  
  November 9, 2022 at 10:02 pm
  
  Overallocation is certainly a viable approach but you cannot always rely on it being done (by libraries you are using).
Matthew Stoltenberg says:

November 10, 2022 at 10:06 pm

What I’ve found works really well for me at my current job is to write the main part of the loop as an inline function taking the __m512/svfloat32_t registers and any mask/predicates as variables. This allows me to not repeat myself for the main part of the loop and the remainders.
1. Daniel Lemire says:
  
  November 10, 2022 at 10:09 pm
  
  I agree that it seems like a good design.
  1. Marcin Zukowski says:
    
    November 11, 2022 at 12:43 am
    
    But doen’t that assume that the versions with and without _maskz (on x86) are equivalent in performance? Perhaps they are, but I’d double check 🙂
    1. Daniel Lemire says:
      
      November 11, 2022 at 1:00 am
      
      You should always measure but I believe it is true (so far) and that’s what I claim in the blog post. The computation of the masks is not free, however.
Denis Bakhvalov says:

November 28, 2022 at 4:54 pm

[A little late to the party…] Sometimes, those masked load/stores also could become hot if the number of processed elements is low (<16, loop never executes). I solved such an issue simply by allocating 16 dummy bytes (for SSE) at the end to avoid access violations by crossing page boundaries.