Daniel Lemire's blog

, 9 min read

Trimming spaces from strings faster with SVE on an Amazon Graviton 3 processor

13 thoughts on “Trimming spaces from strings faster with SVE on an Amazon Graviton 3 processor”

  1. Bill says:

    I wonder if it would speed up Unicode whitespace removal or could be twisted into removing all whitespace type of characters (VT, HT, NL, CR, …)?

    1. It can be generalized, yes. Note that the functions described in the blog post will work correctly with UTF-8 inputs.

  2. Evan says:

    The documentation I have seen shows that SVE does not need a drain operation, it can instead be predicated to do a partial operation for the remainder elements.

    See slides 32 and 33 of this SVE presentation for an example of what I mean.

    Is there a specific reason why you coded it do work in two steps? I could see this being a naive translation from SIMD (NEON) style to SVE, but I don’t think it is leveraging the full benefits of SVE predication.

    1. See slides 32 and 33 of this SVE presentation for an example of what I mean.

      If you write it in this manner, you will have an additional intrinsic function per loop (and the accompanying instruction) compared to my solution (focus on the main loop).

      Is there a specific reason why you coded it do work in two steps?

      For better performance. I have added the single-loop function to my benchmark, you can test it out and verify that it is slower.

      1. Evan says:

        Thanks for the response, I suspected that may be the reason why.

        What if you use the optimization given on slide 34 to elide the i < len comparison? Does it catch back up? Sorry I do not have access right now to test it myself 🙂

        1. I have added the one loop alternative you pointed to. The performance is slightly improved but still very much inferior to the two-loop approach.

          I should point out that my expectation is that my implementation with a loop and a trail is probably too simple and could be optimized further (at the cost of greater code complexity).

          scalar
          cycles/bytes 1.75
          

          sve one loop cycles/bytes 0.70

          sve one loop alt cycles/bytes 0.66

          sve cycles/bytes 0.47

          So, unfortunately, the pretty code that ARM is displaying on their slides is likely not optimal.

          Of course, this could change with future compilers and/or processors. For example, the compiler could do the unrolling for you.

      2. Evan says:

        Also sorry for saying it was a somehow bad or naive translation from neon, clearly not the case.

        Enjoyed the post, thanks for sharing.

        1. Also sorry for saying it was a somehow bad or naive translation from neon, clearly not the case.

          You might be interested in going back in time to how I started with SVE:
          https://lemire.me/blog/2022/06/23/filtering-numbers-quickly-with-sve-on-amazon-graviton-3-processors/

          You might notice that my early code matched the ARM slides more than my recent code.

          So, in some sense, what looked to you like “naive” code is, from my point of view, more sophisticated (optimized) code…i.e., it will run faster.

  3. Evan says:

    Also why does the second operation use svcmpeq while the loop uses svcmpne?

    1. A typographical error. Thanks for pointing it out.

  4. -.- says:

    Do you have any speed comparisons against NEON versions of the earlier despacer?

    1. My source code online includes the NEON version. The performance is comparable. However, the NEON version requires a large table.

      1. -.- says:

        Thanks for the info!