1st February 2023, 18 min read

Serializing IPs quickly in C++

22 thoughts on “Serializing IPs quickly in C++”

pg says:

February 2, 2023 at 2:18 pm

There’s a copy/paste bug on lines 261 and 262 in str.cpp.
pg says:

February 2, 2023 at 2:34 pm

Faster C code:

std::string ipv53(const uint64_t address) noexcept {
std::string output(4 * 3 + 3, ‘\0’);
char *p = output.data();

uint32_t a = __builtin_bswap32(address);

for (int i = 0; i >= 8;
}
–p;

output.resize(p – output.data());
return output;
}
Sean Hoffman says:

February 2, 2023 at 4:18 pm

The uglier version might be faster still if you reduced the calls to output.data() and stored them in a local pointer rather than make 3 separate calls to the same function in scope.
Marcin Zukowski says:

February 2, 2023 at 7:51 pm

Hi Daniel, As always, you take on interesting challenges. First, a minor fix, your code for ipv41 is incorrect, check this out – output.append(std::to_string((address >> (i * 8)) % 256) + “.”); + output.append(“.” + std::to_string((address >> (i * 8)) % 256) ); Now, I took a quick look, and another idea came to me. All “parts” of the IP are at most 4 bytes, so instead of working with strings, you can work with integers. The code below on my notebook is 3x faster than the fastest previous version by Dimov from your repo. Hopefully useful!

std::string ipv52(const uint64_t address) noexcept {
std::string output(4 * 3 + 3, ‘\0’);
char *p = output.data();
p = to_chars_52(p, uint8_t(address >> 24));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 16));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 8));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 0));
output.resize(p – output.data());
return output;
}
class Foo {
public:
Foo() {
for (int i = 0; i < 256; i++) {
std::string s0 = std::to_string(i);
std::string s1 = std::string(“.”) + s0;
v0[i] = *(int*)s0.data();
v1[i] = *(int*)s1.data();
l0[i] = (i < 10) ? 1 : (i < 100 ? 2 : 3);
l1[i] = l0[i] + 1;
}
}
int v0[256]; // value for the 0th byte
int v1[256]; // value for the 1+ byte
int l0[256]; // length for the 0th byte
int l1[256]; // length for the 1+ byte
};
Foo foo;
std::string ipv61(const uint64_t address) noexcept {
std::string output(4 * 3 + 3, ‘\0’);
char *point = output.data();
uint8_t by;
by = address >> 24;
*(int*)point = foo.v0[by];
point += foo.l0[by];
by = address >> 16;
*(int*)point = foo.v1[by];
point += foo.l1[by];
by = address >> 8;
*(int*)point = foo.v1[by];
point += foo.l1[by];
by = address >> 0;
*(int*)point = foo.v1[by];
point += foo.l1[by];
output.resize(point – output.data());
return output;
}
Marcin Zukowski says:

February 2, 2023 at 7:54 pm

Something went wrong in copy/paste, trying again

std::string ipv52(const uint64_t address) noexcept {
std::string output(4 * 3 + 3, ‘\0’);
char *p = output.data();
p = to_chars_52(p, uint8_t(address >> 24));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 16));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 8));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 0));
output.resize(p – output.data());
return output;
}
class Foo {
public:
Foo() {
for (int i = 0; i < 256; i++) {
std::string s0 = std::to_string(i);
std::string s1 = std::string(“.”) + s0;
v0[i] = *(int*)s0.data();
v1[i] = *(int*)s1.data();
l0[i] = (i < 10) ? 1 : (i < 100 ? 2 : 3);
l1[i] = l0[i] + 1;
}
}
int v0[256]; // value for the 0th byte
int v1[256]; // value for the 1+ byte
int l0[256]; // length for the 0th byte
int l1[256]; // length for the 1+ byte
};
Foo foo;
std::string ipv61(const uint64_t address) noexcept {
std::string output(4 * 3 + 3, ‘\0’);
char *point = output.data();
uint8_t by;
by = address >> 24;
*(int*)point = foo.v0[by];
point += foo.l0[by];
by = address >> 16;
*(int*)point = foo.v1[by];
point += foo.l1[by];
by = address >> 8;
*(int*)point = foo.v1[by];
point += foo.l1[by];
by = address >> 0;
*(int*)point = foo.v1[by];
point += foo.l1[by];
output.resize(point – output.data());
return output;
}
Marcin Zukowski says:

February 2, 2023 at 8:04 pm

Argh, copy-pasting somehow doesn’t work.
Created a PR here: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/pull/70

BTW, you can extend my idea probably by combining two integers into a long and one write, not sure if it’s worth optimizing further though.
1. Daniel Lemire says:
  
  February 2, 2023 at 8:07 pm
  
  WordPress insists on stripping codes. Sorry about that. I have had several fixes over the year for this issue, but it stops working with new WordPress updates.
2. pg says:
  
  February 2, 2023 at 9:33 pm
  
  Hi Marcin,
  
  I made a vectorized version that uses similar ideas to yours: stick the bytes in a u32x4 vector, and do a gather_epi32 from a table. Then remove the 0 bytes with _mm_shuffle_epi8 (like in https://lemire.me/blog/2017/01/20/how-quickly-can-you-remove-spaces-from-a-string/).
  
  In the end it’s pretty much a wash with your version.
  
  std::string output(4 * 3 + 3, '\0'); char *p = output.data();
  __m128i bytes = _mm_cvtepu8_epi32(_mm_set1_epi32(__builtin_bswap32(address))); __m128i strs = _mm_i32gather_epi32(table, bytes, 4); __m128i zeroes = _mm_cmpeq_epi8(strs, _mm_set1_epi32(0)); int mask16 = _mm_movemask_epi8(zeroes); __m128i x = _mm_shuffle_epi8(strs, _mm_loadu_si128((const __m128i *)despace_mask16 + (mask16 & 0x7fff))); _mm_storeu_si128((__m128i *)p, x); output.resize(16 - _mm_popcnt_u32(mask16) - 1);
  return output;
  1. Marcin Zukowski says:
    
    February 3, 2023 at 8:00 pm
    
    Cool. For such small tasks SIMD might be an overkill, but it’s great to see that it works. You should send a PR to Daniel. Unfortunately, don’t have an x86 machine around to test it.
    
    Another interesting fun task would be to write a non-scalar function (“vectorized” in database terms, not like SIMD), which would convert many ints into many strings. Wonder if there are some opportunities there, and if that could be much faster than N * time_of_scalar
Anonymous says:

February 3, 2023 at 11:02 am

What about std::stringstream? Is it too slow and thus not worth considering?
1. Daniel Lemire says:
  
  February 6, 2023 at 2:54 pm
  Added to the benchmark:
```
Time per string in ns.
stringstream        350.224
std::to_string (1)  78.8603
std::to_string (2)  79.012
blog post (similar) 16.5528
blog post           16.2065
Dimov 1             10.8528
Dimov 2             16.7799
Zukowski            3.86899
Zukowski by oc      3.89452
thin table          4.16718
combined table      2.99737
```
2. Marcin Zukowski says:
  
  February 6, 2023 at 5:10 pm
  
  About the std::stringstream suggestion – I learned the hard way a while back that it should not be used anywhere performance critical. It looks at locale settings, which might result in things like environment varialbe lookups, mutexes etc. Terrible? Yes. Since then I am on a mission to educate people to NOT USE std::stringstream 😀
  
  You can work around this with static variables etc, but it’s not worth the complexity.
foobar says:

February 5, 2023 at 1:59 pm

It would also be plausible to pack the string (including the ending/preceding period) in a 32-bit integer and one should be able to extract the string with an AND, and the length with an unsigned fixed shift right, which would allow using only one load operation per input byte. Two top bits of all string characters are zero; just put length bits on two topmost bits of the integer…

With high ALU parallelism it might competitive with two-loads-and-no-arithmetic versions – also, it should be vectorisable. I suspect I’m too lazy to implement it, at least today.
1. foobar says:
  
  February 5, 2023 at 2:21 pm
  
  It should be plausible to pack both the string and the length in 32-bit integer is what I meant…
2. foobar says:
  
  February 6, 2023 at 5:27 am
  
  Merge request is in, and this “combined table” beats others on Mac M1; no idea how it fares on x86.
  
  $ ./str Running tests. Functions are ok. Time per string in ns. std::to_string (1) 85.8071 std::to_string (2) 90.9182 blog post (similar) 18.6132 blog post 18.5872 Dimov 1 11.8821 Dimov 2 18.5636 Zukowski 4.15277 Zukowski by oc 4.22096 thin table 4.54168 combined table 3.78396
  1. foobar says:
    
    February 6, 2023 at 6:08 am
    
    And with a minor modification, I was able to push this to 3.3 ns on Mac M1. I wonder how large a portion of that is actually the call to string allocation routine?
    1. Daniel Lemire says:
      
      February 6, 2023 at 2:41 pm
      
      It is a good question.
      1. foobar says:
        
        February 7, 2023 at 6:20 am
        
        I believe the std::string calls (inlined or not) are a significant source of overhead when single-iteration run time of a function, like ipv81, is around 3 nanoseconds. Just eliminating single store saved 6% in run time, and in comparison to hypothetical plain C string variant, the code zeroes the string unnecessarily (optimally one or two stores, depending on microarchitectural details), and does whatever the constructor does, at best (if worse, it could do a lot more), which is pretty certainly more than just a little, since this code doesn’t get inlined.
        
        In theory compilers could do better, but apparently not in practice…
      2. foobar says:
        
        February 7, 2023 at 6:35 am
        
        I rewrote ipv81 with plain C, writing a zero-terminated string to a fixed, pre-allocated buffer (and forcing compiler to run all iterations properly, I did check the assembler output). My M1 succeeded running this routine at 1.19 ns/iteration, while earlier variations were 3.04-3.3 ns depending on slight variations in code. So, vast majority of benchmarked time on the fastest variations is probably “overhead.”
        
        foobar says:
        
        February 7, 2023 at 10:24 am
        
        … and that can be still pushed further, to 1.02 ns, of which probably quarter a nanosecond goes to looping. If lookup table entries are made 64-bit the masking of the string can be avoided, and this allows dropping per-iteration time to 0.88 ns, but I highly doubt that makes much sense in the real world.
        
        So, with hot caches one can achieve roughly one-nanosecond (3.25 cycle) throughput on IPv4 string generation, at least on amenable microarchitectures.
        
        David Fetter says:
        
        April 20, 2023 at 4:01 pm
        
        Did you happen to keep this code around?
        
        foobar says:
        
        April 20, 2023 at 5:18 pm
        
        I don’t think I have it on version control, but there wasn’t really anything particularly fancy; just handle strings as static C char arrays when possible, and avoid dynamic allocation in inner loops. My general, C++ strings oriented changes are in the repo, though.
        
        One thing I noticed was that these routines didn’t really translate that well to x86. Certain idioms of ARM assembler such as barrel shifters integrated to other instructions made the implementation significantly more efficient on ARM than x86.