Daniel Lemire's blog

, 16 min read

Avoid character-by-character processing when performance matters

20 thoughts on “Avoid character-by-character processing when performance matters”

  1. Marcin Buchwald says:

    Can eliminate the second loop. Check if i > 0 and if so, slice last 8 bytes and or-them to the result

    1. Care to share your code for the “slice last 8 bytes” part?

      1. Guillaume Matte says:

        For the left-over string (smaller than 8 characters but greater than 0) still do a memcpy of into a zeroed 64 bits unsigned integer just use v.size() as the length parameter. Your check against the 64-bits mask should still be right as the memcpy won’t touch the zeroed bits of the payload that are outside of the length.

        1. Travis Downs says:

          A variable length memcpy isn’t much better than this short loop and internally may use a loop or some sort of indirect branch dispatching to a copy routine specialized for the length.

          I think a better approach, if you expect the total input length to usually be at least 8, is to do a final length 8 memcpy, but aligned to the end of the string. If the length of the string wasn’t a multiple of 8, this checks some characters “twice” but this check is very fast in any case.

          With this change you can also change the primary loop condition to i + 8 < v.size() since the final 1 to 8 bytes can be handled by this last check, and you’d rather do it unconditionally (at least for the v.size() >= case) to avoid an extra branch misprediction.

  2. Stuart Dootson says:

    Something like this, I’d presume…

    if (v.size() > 8) {
    memcpy(&payload, v.data() + v.size() - 8, 8);
    running |= payload;
    }

    But then you’re still left with cases where v.size() < 8, where you’d need the second loop anyway…

    1. Nick Everitt says:

      I wonder if it would be an improvement to switch on the number of remaining characters rather than use a second loop (assuming that string length is evenly distributed).

      1. Travis Downs says:

        The way Stuart suggests I think is almost strictly better than the loop or the switch: both involve an unpredictable branch for unpredictable sizes. Between the two, the switch is probably somewhat faster at the cost of larger code size.

        Stuart’s approach only involves an unpredictable branch in the case the size varies between less than 8 and greater than 8 unpredictably, which seems much rarer and as a binary predicate would be somewhat predictable in almost any non-contrived case. If you really want to get rid of this final source of unpredictability, you can use a masked load.

  3. Dr. Guy Gordon says:

    After each check for ASCII, double the number of reads before the next check.

  4. Alex says:

    This is a fun exercise, but does this ever occur in real life? I can’t say I’ve ever needed to test a large block of arbitrary bytes to determine if they were ASCII.

    1. This blog post was motivated by an actual programming problem that someone encountered, while trying to optimize some code.

    2. RobIII says:

      I can’t say I’ve ever needed to test a large block of arbitrary bytes to determine if they were ASCII.

      I have. I do / did on a regular basis. You could consider it a ‘large block of arbitrary bytes’ but also lots and lots of small block (strings) of arbitrary bytes. They promise some field should be ASCII (it’s actually called “AsciiName”) for old(er) systems that don’t support anything else but ASCII but non-ASCII values keep returning in their exports. Up to this day. Exactly why I based my .Net Benchmarks based on this post on that exact data.

      1. Alex says:

        It sounds like the real problem you’re facing is that you’re trying to maintain a ‘database’ which has no schema enforcement (giant binary file) and no access control (unknown users can edit it at any time).

        Given those circumstances, writing a program to parse hundreds of megabytes to check for the high-bit being set can never be a complete or reliable solution. (No matter how fast you can check it, there will always be a race condition, as you note.) Wouldn’t a better solution be to create an interface to this data which enforced the desired schema during edits? You only need to check the diffs.

        No other system I use — and certainly none which cares about performance or security — uses the approach of letting anyone make arbitrary edits to raw data at any time, and then trying to verify the entire database after the fact. I don’t run a script every night to make sure all my JPEGs are still JPEGs, and an unknown user didn’t log in and put bogus data in them. The closest I can think of is Wikipedia, and even they don’t analyze hundreds of megabytes on every edit.

        I’d say: “Avoid doing O(n) work unnecessarily when performance matters”. You can justify all sorts of craziness if you begin by using the wrong data structure. Unfixable legacy system designs are one possible use of this, true, but I wouldn’t say they’re common.

        1. RobIII says:

          Except that I have no control over the data (3rd party data). And I have told them on several occasions that their “ASCII” input/validation is broken. But there’s nothing I can do to fix it. All you can do is download dumps.

          And ofcourse you can create (or download) diffs and ofcourse you can do all kinds of smart optimisations. That’s all besides the point. If you have to, for -whatever- occasion, do this then it’s nice to know what works. Hell, even if it’s just for fun. Because we can. Just because it’s beyond your imagination doesn’t mean it’s not real.

    3. Peter Dimov says:

      Yes, it does. If you want to validate UTF-8, and your strings are almost always ASCII, it’s much faster to check for ASCII first and skip validation if so.

  5. Mike Brown says:

    Yes, treating the last 8 and overlaying a few bytes is quicker in my tests. Here is a version that also handles the < 8 bytes case.

    inline bool is_ascii_branchless2(const std::string_view v) {
    uint64_t less_than_8 = 0;
    uint64_t* payload = &less_than_8;
    uint8_t r = (v.size() & 7);
    size_t d = (v.size() >> 3);

    if (v.size() <= 8) { // Equal to 8 to skip checking the block twice.
    memcpy(&less_than_8, v.data(), r);
    } else {
    payload = (uint64_t*)( v.data() );
    }
    uint64_t running = 0;

    for(; d; payload++, d--) {
    running |= *payload;
    }

    uint8_t* remaining = (uint8_t*)payload;
    remaining += r;
    running |= *(uint64_t*)remaining;

    return ((running & 0x8080808080808080) == 0);
    }

  6. RobIII says:

    I have created a .Net version to be able to compare/benchmark the methods in a .Net variant.

  7. Sven Kautlenbach says:

    That line where character data is memcpy-ed

    memcpy(&payload, v.data() + i, 8);

    ‘payload’ variable might have whatever data that comes after the buffer if the memcpy transfers 8 bytes every time. In your test examples you used an input of 1000000 chars which %8 == 0.

    1. Alex says:

      That’s why it’s inside a loop with condition “i + 8 <= v.size()”, to protect from ever over-reading the buffer.

      1. Sven Kautlenbach says:

        Yep, my bad, I was thinking 2 things at once – over-reading and under-reading thats why I misunderstood it at first, thanks.

  8. Kristian Meyer says:

    Thanks for this post, I enjoy the examples of optimisations on this blog.

    If you try a mix of non-ascii strings, the branchless approach still performs well when a high percentage of strings are ascii. I found the crossover between the branchless and nearly branchless approaches to be at about 83% ascii strings for this benchmark.

    To test that, I replaced the following on line 107:

    buffer[i] = (rand() % 128);

    With the following to get a mix of nonascii characters created within the strings:

    if (rand() % 160 == 0)
    buffer[i] = (rand() % 256);
    else
    buffer[i] = (rand() % 128);

    I get the following output on an i7-3770 with gcc 9.3.0:

    Creating 1000000 strings
    branchy:1.56017 GB/s
    827420
    branchless:3.32739 GB/s
    827420
    nearly branchless:3.32863 GB/s
    827420