Daniel Lemire's blog

, 18 min read

Serializing IPs quickly in C++

22 thoughts on “Serializing IPs quickly in C++”

  1. pg says:

    There’s a copy/paste bug on lines 261 and 262 in str.cpp.

  2. pg says:

    Faster C code:

    std::string ipv53(const uint64_t address) noexcept {
    std::string output(4 * 3 + 3, ‘\0’);
    char *p = output.data();

    uint32_t a = __builtin_bswap32(address);

    for (int i = 0; i >= 8;
    }
    –p;

    output.resize(p – output.data());
    return output;
    }

  3. Sean Hoffman says:

    The uglier version might be faster still if you reduced the calls to output.data() and stored them in a local pointer rather than make 3 separate calls to the same function in scope.

  4. Marcin Zukowski says:

    Hi Daniel, As always, you take on interesting challenges. First, a minor fix, your code for ipv41 is incorrect, check this out – output.append(std::to_string((address >> (i * 8)) % 256) + “.”); + output.append(“.” + std::to_string((address >> (i * 8)) % 256) ); Now, I took a quick look, and another idea came to me. All “parts” of the IP are at most 4 bytes, so instead of working with strings, you can work with integers. The code below on my notebook is 3x faster than the fastest previous version by Dimov from your repo. Hopefully useful!

    std::string ipv52(const uint64_t address) noexcept {
    std::string output(4 * 3 + 3, ‘\0’);
    char *p = output.data();
    p = to_chars_52(p, uint8_t(address >> 24));
    *p++ = ‘.’;
    p = to_chars_52(p, uint8_t(address >> 16));
    *p++ = ‘.’;
    p = to_chars_52(p, uint8_t(address >> 8));
    *p++ = ‘.’;
    p = to_chars_52(p, uint8_t(address >> 0));
    output.resize(p – output.data());
    return output;
    }
    class Foo {
    public:
    Foo() {
    for (int i = 0; i < 256; i++) {
    std::string s0 = std::to_string(i);
    std::string s1 = std::string(“.”) + s0;
    v0[i] = *(int*)s0.data();
    v1[i] = *(int*)s1.data();
    l0[i] = (i < 10) ? 1 : (i < 100 ? 2 : 3);
    l1[i] = l0[i] + 1;
    }
    }
    int v0[256]; // value for the 0th byte
    int v1[256]; // value for the 1+ byte
    int l0[256]; // length for the 0th byte
    int l1[256]; // length for the 1+ byte
    };
    Foo foo;
    std::string ipv61(const uint64_t address) noexcept {
    std::string output(4 * 3 + 3, ‘\0’);
    char *point = output.data();
    uint8_t by;
    by = address >> 24;
    *(int*)point = foo.v0[by];
    point += foo.l0[by];
    by = address >> 16;
    *(int*)point = foo.v1[by];
    point += foo.l1[by];
    by = address >> 8;
    *(int*)point = foo.v1[by];
    point += foo.l1[by];
    by = address >> 0;
    *(int*)point = foo.v1[by];
    point += foo.l1[by];
    output.resize(point – output.data());
    return output;
    }

  5. Marcin Zukowski says:

    Something went wrong in copy/paste, trying again

    std::string ipv52(const uint64_t address) noexcept {
    std::string output(4 * 3 + 3, ‘\0’);
    char *p = output.data();
    p = to_chars_52(p, uint8_t(address >> 24));
    *p++ = ‘.’;
    p = to_chars_52(p, uint8_t(address >> 16));
    *p++ = ‘.’;
    p = to_chars_52(p, uint8_t(address >> 8));
    *p++ = ‘.’;
    p = to_chars_52(p, uint8_t(address >> 0));
    output.resize(p – output.data());
    return output;
    }
    class Foo {
    public:
    Foo() {
    for (int i = 0; i < 256; i++) {
    std::string s0 = std::to_string(i);
    std::string s1 = std::string(“.”) + s0;
    v0[i] = *(int*)s0.data();
    v1[i] = *(int*)s1.data();
    l0[i] = (i < 10) ? 1 : (i < 100 ? 2 : 3);
    l1[i] = l0[i] + 1;
    }
    }
    int v0[256]; // value for the 0th byte
    int v1[256]; // value for the 1+ byte
    int l0[256]; // length for the 0th byte
    int l1[256]; // length for the 1+ byte
    };
    Foo foo;
    std::string ipv61(const uint64_t address) noexcept {
    std::string output(4 * 3 + 3, ‘\0’);
    char *point = output.data();
    uint8_t by;
    by = address >> 24;
    *(int*)point = foo.v0[by];
    point += foo.l0[by];
    by = address >> 16;
    *(int*)point = foo.v1[by];
    point += foo.l1[by];
    by = address >> 8;
    *(int*)point = foo.v1[by];
    point += foo.l1[by];
    by = address >> 0;
    *(int*)point = foo.v1[by];
    point += foo.l1[by];
    output.resize(point – output.data());
    return output;
    }

  6. Marcin Zukowski says:

    Argh, copy-pasting somehow doesn’t work.
    Created a PR here: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/pull/70

    BTW, you can extend my idea probably by combining two integers into a long and one write, not sure if it’s worth optimizing further though.

    1. WordPress insists on stripping codes. Sorry about that. I have had several fixes over the year for this issue, but it stops working with new WordPress updates.

    2. pg says:

      Hi Marcin,

      I made a vectorized version that uses similar ideas to yours: stick the bytes in a u32x4 vector, and do a gather_epi32 from a table. Then remove the 0 bytes with _mm_shuffle_epi8 (like in https://lemire.me/blog/2017/01/20/how-quickly-can-you-remove-spaces-from-a-string/).

      In the end it’s pretty much a wash with your version.

      std::string output(4 * 3 + 3, '\0');
      char *p = output.data();

      __m128i bytes = _mm_cvtepu8_epi32(_mm_set1_epi32(__builtin_bswap32(address)));
      __m128i strs = _mm_i32gather_epi32(table, bytes, 4);
      __m128i zeroes = _mm_cmpeq_epi8(strs, _mm_set1_epi32(0));
      int mask16 = _mm_movemask_epi8(zeroes);
      __m128i x = _mm_shuffle_epi8(strs, _mm_loadu_si128((const __m128i *)despace_mask16 + (mask16 & 0x7fff)));

      _mm_storeu_si128((__m128i *)p, x);

      output.resize(16 - _mm_popcnt_u32(mask16) - 1);

      return output;

      1. Marcin Zukowski says:

        Cool. For such small tasks SIMD might be an overkill, but it’s great to see that it works. You should send a PR to Daniel. Unfortunately, don’t have an x86 machine around to test it.

        Another interesting fun task would be to write a non-scalar function (“vectorized” in database terms, not like SIMD), which would convert many ints into many strings. Wonder if there are some opportunities there, and if that could be much faster than N * time_of_scalar

  7. Anonymous says:

    What about std::stringstream? Is it too slow and thus not worth considering?

    1. Added to the benchmark:

      Time per string in ns.
      stringstream        350.224
      std::to_string (1)  78.8603
      std::to_string (2)  79.012
      blog post (similar) 16.5528
      blog post           16.2065
      Dimov 1             10.8528
      Dimov 2             16.7799
      Zukowski            3.86899
      Zukowski by oc      3.89452
      thin table          4.16718
      combined table      2.99737
      
    2. Marcin Zukowski says:

      About the std::stringstream suggestion – I learned the hard way a while back that it should not be used anywhere performance critical. It looks at locale settings, which might result in things like environment varialbe lookups, mutexes etc. Terrible? Yes. Since then I am on a mission to educate people to NOT USE std::stringstream 😀

      You can work around this with static variables etc, but it’s not worth the complexity.

  8. foobar says:

    It would also be plausible to pack the string (including the ending/preceding period) in a 32-bit integer and one should be able to extract the string with an AND, and the length with an unsigned fixed shift right, which would allow using only one load operation per input byte. Two top bits of all string characters are zero; just put length bits on two topmost bits of the integer…

    With high ALU parallelism it might competitive with two-loads-and-no-arithmetic versions – also, it should be vectorisable. I suspect I’m too lazy to implement it, at least today.

    1. foobar says:

      It should be plausible to pack both the string and the length in 32-bit integer is what I meant…

    2. foobar says:

      Merge request is in, and this “combined table” beats others on Mac M1; no idea how it fares on x86.

      $ ./str
      Running tests.
      Functions are ok.
      Time per string in ns.
      std::to_string (1) 85.8071
      std::to_string (2) 90.9182
      blog post (similar) 18.6132
      blog post 18.5872
      Dimov 1 11.8821
      Dimov 2 18.5636
      Zukowski 4.15277
      Zukowski by oc 4.22096
      thin table 4.54168
      combined table 3.78396

      1. foobar says:

        And with a minor modification, I was able to push this to 3.3 ns on Mac M1. I wonder how large a portion of that is actually the call to string allocation routine?

        1. It is a good question.

          1. foobar says:

            I believe the std::string calls (inlined or not) are a significant source of overhead when single-iteration run time of a function, like ipv81, is around 3 nanoseconds. Just eliminating single store saved 6% in run time, and in comparison to hypothetical plain C string variant, the code zeroes the string unnecessarily (optimally one or two stores, depending on microarchitectural details), and does whatever the constructor does, at best (if worse, it could do a lot more), which is pretty certainly more than just a little, since this code doesn’t get inlined.

            In theory compilers could do better, but apparently not in practice…

          2. foobar says:

            I rewrote ipv81 with plain C, writing a zero-terminated string to a fixed, pre-allocated buffer (and forcing compiler to run all iterations properly, I did check the assembler output). My M1 succeeded running this routine at 1.19 ns/iteration, while earlier variations were 3.04-3.3 ns depending on slight variations in code. So, vast majority of benchmarked time on the fastest variations is probably “overhead.”

            1. foobar says:

              … and that can be still pushed further, to 1.02 ns, of which probably quarter a nanosecond goes to looping. If lookup table entries are made 64-bit the masking of the string can be avoided, and this allows dropping per-iteration time to 0.88 ns, but I highly doubt that makes much sense in the real world.

              So, with hot caches one can achieve roughly one-nanosecond (3.25 cycle) throughput on IPv4 string generation, at least on amenable microarchitectures.

              1. David Fetter says:

                Did you happen to keep this code around?

                1. foobar says:

                  I don’t think I have it on version control, but there wasn’t really anything particularly fancy; just handle strings as static C char arrays when possible, and avoid dynamic allocation in inner loops. My general, C++ strings oriented changes are in the repo, though.

                  One thing I noticed was that these routines didn’t really translate that well to x86. Certain idioms of ARM assembler such as barrel shifters integrated to other instructions made the implementation significantly more efficient on ARM than x86.