The uglier version might be faster still if you reduced the calls to output.data() and stored them in a local pointer rather than make 3 separate calls to the same function in scope.
Marcin Zukowskisays:
Hi Daniel, As always, you take on interesting challenges. First, a minor fix, your code for ipv41 is incorrect, check this out – output.append(std::to_string((address >> (i * 8)) % 256) + “.”); + output.append(“.” + std::to_string((address >> (i * 8)) % 256) ); Now, I took a quick look, and another idea came to me. All “parts” of the IP are at most 4 bytes, so instead of working with strings, you can work with integers. The code below on my notebook is 3x faster than the fastest previous version by Dimov from your repo. Hopefully useful!
std::string ipv52(const uint64_t address) noexcept {
std::string output(4 * 3 + 3, ‘\0’);
char *p = output.data();
p = to_chars_52(p, uint8_t(address >> 24));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 16));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 8));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 0));
output.resize(p – output.data());
return output;
}
class Foo {
public:
Foo() {
for (int i = 0; i < 256; i++) {
std::string s0 = std::to_string(i);
std::string s1 = std::string(“.”) + s0;
v0[i] = *(int*)s0.data();
v1[i] = *(int*)s1.data();
l0[i] = (i < 10) ? 1 : (i < 100 ? 2 : 3);
l1[i] = l0[i] + 1;
}
}
int v0[256]; // value for the 0th byte
int v1[256]; // value for the 1+ byte
int l0[256]; // length for the 0th byte
int l1[256]; // length for the 1+ byte
};
Foo foo;
std::string ipv61(const uint64_t address) noexcept {
std::string output(4 * 3 + 3, ‘\0’);
char *point = output.data();
uint8_t by;
by = address >> 24;
*(int*)point = foo.v0[by];
point += foo.l0[by];
by = address >> 16;
*(int*)point = foo.v1[by];
point += foo.l1[by];
by = address >> 8;
*(int*)point = foo.v1[by];
point += foo.l1[by];
by = address >> 0;
*(int*)point = foo.v1[by];
point += foo.l1[by];
output.resize(point – output.data());
return output;
}
Marcin Zukowskisays:
Something went wrong in copy/paste, trying again
std::string ipv52(const uint64_t address) noexcept {
std::string output(4 * 3 + 3, ‘\0’);
char *p = output.data();
p = to_chars_52(p, uint8_t(address >> 24));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 16));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 8));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 0));
output.resize(p – output.data());
return output;
}
class Foo {
public:
Foo() {
for (int i = 0; i < 256; i++) {
std::string s0 = std::to_string(i);
std::string s1 = std::string(“.”) + s0;
v0[i] = *(int*)s0.data();
v1[i] = *(int*)s1.data();
l0[i] = (i < 10) ? 1 : (i < 100 ? 2 : 3);
l1[i] = l0[i] + 1;
}
}
int v0[256]; // value for the 0th byte
int v1[256]; // value for the 1+ byte
int l0[256]; // length for the 0th byte
int l1[256]; // length for the 1+ byte
};
Foo foo;
std::string ipv61(const uint64_t address) noexcept {
std::string output(4 * 3 + 3, ‘\0’);
char *point = output.data();
uint8_t by;
by = address >> 24;
*(int*)point = foo.v0[by];
point += foo.l0[by];
by = address >> 16;
*(int*)point = foo.v1[by];
point += foo.l1[by];
by = address >> 8;
*(int*)point = foo.v1[by];
point += foo.l1[by];
by = address >> 0;
*(int*)point = foo.v1[by];
point += foo.l1[by];
output.resize(point – output.data());
return output;
}
WordPress insists on stripping codes. Sorry about that. I have had several fixes over the year for this issue, but it stops working with new WordPress updates.
Cool. For such small tasks SIMD might be an overkill, but it’s great to see that it works. You should send a PR to Daniel. Unfortunately, don’t have an x86 machine around to test it.
Another interesting fun task would be to write a non-scalar function (“vectorized” in database terms, not like SIMD), which would convert many ints into many strings. Wonder if there are some opportunities there, and if that could be much faster than N * time_of_scalar
Anonymoussays:
What about std::stringstream? Is it too slow and thus not worth considering?
Time per string in ns.
stringstream 350.224
std::to_string (1) 78.8603
std::to_string (2) 79.012
blog post (similar) 16.5528
blog post 16.2065
Dimov 1 10.8528
Dimov 2 16.7799
Zukowski 3.86899
Zukowski by oc 3.89452
thin table 4.16718
combined table 2.99737
Marcin Zukowskisays:
About the std::stringstream suggestion – I learned the hard way a while back that it should not be used anywhere performance critical. It looks at locale settings, which might result in things like environment varialbe lookups, mutexes etc. Terrible? Yes. Since then I am on a mission to educate people to NOT USE std::stringstream 😀
You can work around this with static variables etc, but it’s not worth the complexity.
foobarsays:
It would also be plausible to pack the string (including the ending/preceding period) in a 32-bit integer and one should be able to extract the string with an AND, and the length with an unsigned fixed shift right, which would allow using only one load operation per input byte. Two top bits of all string characters are zero; just put length bits on two topmost bits of the integer…
With high ALU parallelism it might competitive with two-loads-and-no-arithmetic versions – also, it should be vectorisable. I suspect I’m too lazy to implement it, at least today.
foobarsays:
It should be plausible to pack both the string and the length in 32-bit integer is what I meant…
foobarsays:
Merge request is in, and this “combined table” beats others on Mac M1; no idea how it fares on x86.
$ ./str
Running tests.
Functions are ok.
Time per string in ns.
std::to_string (1) 85.8071
std::to_string (2) 90.9182
blog post (similar) 18.6132
blog post 18.5872
Dimov 1 11.8821
Dimov 2 18.5636
Zukowski 4.15277
Zukowski by oc 4.22096
thin table 4.54168
combined table 3.78396
foobarsays:
And with a minor modification, I was able to push this to 3.3 ns on Mac M1. I wonder how large a portion of that is actually the call to string allocation routine?
I believe the std::string calls (inlined or not) are a significant source of overhead when single-iteration run time of a function, like ipv81, is around 3 nanoseconds. Just eliminating single store saved 6% in run time, and in comparison to hypothetical plain C string variant, the code zeroes the string unnecessarily (optimally one or two stores, depending on microarchitectural details), and does whatever the constructor does, at best (if worse, it could do a lot more), which is pretty certainly more than just a little, since this code doesn’t get inlined.
In theory compilers could do better, but apparently not in practice…
foobarsays:
I rewrote ipv81 with plain C, writing a zero-terminated string to a fixed, pre-allocated buffer (and forcing compiler to run all iterations properly, I did check the assembler output). My M1 succeeded running this routine at 1.19 ns/iteration, while earlier variations were 3.04-3.3 ns depending on slight variations in code. So, vast majority of benchmarked time on the fastest variations is probably “overhead.”
foobarsays:
… and that can be still pushed further, to 1.02 ns, of which probably quarter a nanosecond goes to looping. If lookup table entries are made 64-bit the masking of the string can be avoided, and this allows dropping per-iteration time to 0.88 ns, but I highly doubt that makes much sense in the real world.
So, with hot caches one can achieve roughly one-nanosecond (3.25 cycle) throughput on IPv4 string generation, at least on amenable microarchitectures.
David Fettersays:
Did you happen to keep this code around?
foobarsays:
I don’t think I have it on version control, but there wasn’t really anything particularly fancy; just handle strings as static C char arrays when possible, and avoid dynamic allocation in inner loops. My general, C++ strings oriented changes are in the repo, though.
One thing I noticed was that these routines didn’t really translate that well to x86. Certain idioms of ARM assembler such as barrel shifters integrated to other instructions made the implementation significantly more efficient on ARM than x86.
There’s a copy/paste bug on lines 261 and 262 in str.cpp.
Faster C code:
The uglier version might be faster still if you reduced the calls to output.data() and stored them in a local pointer rather than make 3 separate calls to the same function in scope.
Hi Daniel, As always, you take on interesting challenges. First, a minor fix, your code for ipv41 is incorrect, check this out – output.append(std::to_string((address >> (i * 8)) % 256) + “.”); + output.append(“.” + std::to_string((address >> (i * 8)) % 256) ); Now, I took a quick look, and another idea came to me. All “parts” of the IP are at most 4 bytes, so instead of working with strings, you can work with integers. The code below on my notebook is 3x faster than the fastest previous version by Dimov from your repo. Hopefully useful!
std::string ipv52(const uint64_t address) noexcept {
std::string output(4 * 3 + 3, ‘\0’);
char *p = output.data();
p = to_chars_52(p, uint8_t(address >> 24));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 16));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 8));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 0));
output.resize(p – output.data());
return output;
}
class Foo {
public:
Foo() {
for (int i = 0; i < 256; i++) {
std::string s0 = std::to_string(i);
std::string s1 = std::string(“.”) + s0;
v0[i] = *(int*)s0.data();
v1[i] = *(int*)s1.data();
l0[i] = (i < 10) ? 1 : (i < 100 ? 2 : 3);
l1[i] = l0[i] + 1;
}
}
int v0[256]; // value for the 0th byte
int v1[256]; // value for the 1+ byte
int l0[256]; // length for the 0th byte
int l1[256]; // length for the 1+ byte
};
Foo foo;
std::string ipv61(const uint64_t address) noexcept {
std::string output(4 * 3 + 3, ‘\0’);
char *point = output.data();
uint8_t by;
by = address >> 24;
*(int*)point = foo.v0[by];
point += foo.l0[by];
by = address >> 16;
*(int*)point = foo.v1[by];
point += foo.l1[by];
by = address >> 8;
*(int*)point = foo.v1[by];
point += foo.l1[by];
by = address >> 0;
*(int*)point = foo.v1[by];
point += foo.l1[by];
output.resize(point – output.data());
return output;
}
Something went wrong in copy/paste, trying again
std::string ipv52(const uint64_t address) noexcept {
std::string output(4 * 3 + 3, ‘\0’);
char *p = output.data();
p = to_chars_52(p, uint8_t(address >> 24));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 16));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 8));
*p++ = ‘.’;
p = to_chars_52(p, uint8_t(address >> 0));
output.resize(p – output.data());
return output;
}
class Foo {
public:
Foo() {
for (int i = 0; i < 256; i++) {
std::string s0 = std::to_string(i);
std::string s1 = std::string(“.”) + s0;
v0[i] = *(int*)s0.data();
v1[i] = *(int*)s1.data();
l0[i] = (i < 10) ? 1 : (i < 100 ? 2 : 3);
l1[i] = l0[i] + 1;
}
}
int v0[256]; // value for the 0th byte
int v1[256]; // value for the 1+ byte
int l0[256]; // length for the 0th byte
int l1[256]; // length for the 1+ byte
};
Foo foo;
std::string ipv61(const uint64_t address) noexcept {
std::string output(4 * 3 + 3, ‘\0’);
char *point = output.data();
uint8_t by;
by = address >> 24;
*(int*)point = foo.v0[by];
point += foo.l0[by];
by = address >> 16;
*(int*)point = foo.v1[by];
point += foo.l1[by];
by = address >> 8;
*(int*)point = foo.v1[by];
point += foo.l1[by];
by = address >> 0;
*(int*)point = foo.v1[by];
point += foo.l1[by];
output.resize(point – output.data());
return output;
}
Argh, copy-pasting somehow doesn’t work.
Created a PR here: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/pull/70
BTW, you can extend my idea probably by combining two integers into a long and one write, not sure if it’s worth optimizing further though.
WordPress insists on stripping codes. Sorry about that. I have had several fixes over the year for this issue, but it stops working with new WordPress updates.
Hi Marcin,
I made a vectorized version that uses similar ideas to yours: stick the bytes in a u32x4 vector, and do a gather_epi32 from a table. Then remove the 0 bytes with _mm_shuffle_epi8 (like in https://lemire.me/blog/2017/01/20/how-quickly-can-you-remove-spaces-from-a-string/).
In the end it’s pretty much a wash with your version.
std::string output(4 * 3 + 3, '\0');
char *p = output.data();
__m128i bytes = _mm_cvtepu8_epi32(_mm_set1_epi32(__builtin_bswap32(address)));
__m128i strs = _mm_i32gather_epi32(table, bytes, 4);
__m128i zeroes = _mm_cmpeq_epi8(strs, _mm_set1_epi32(0));
int mask16 = _mm_movemask_epi8(zeroes);
__m128i x = _mm_shuffle_epi8(strs, _mm_loadu_si128((const __m128i *)despace_mask16 + (mask16 & 0x7fff)));
_mm_storeu_si128((__m128i *)p, x);
output.resize(16 - _mm_popcnt_u32(mask16) - 1);
return output;
Cool. For such small tasks SIMD might be an overkill, but it’s great to see that it works. You should send a PR to Daniel. Unfortunately, don’t have an x86 machine around to test it.
Another interesting fun task would be to write a non-scalar function (“vectorized” in database terms, not like SIMD), which would convert many ints into many strings. Wonder if there are some opportunities there, and if that could be much faster than
N * time_of_scalar
What about std::stringstream? Is it too slow and thus not worth considering?
Added to the benchmark:
About the
std::stringstream
suggestion – I learned the hard way a while back that it should not be used anywhere performance critical. It looks at locale settings, which might result in things like environment varialbe lookups, mutexes etc. Terrible? Yes. Since then I am on a mission to educate people toNOT USE std::stringstream
😀You can work around this with static variables etc, but it’s not worth the complexity.
It would also be plausible to pack the string (including the ending/preceding period) in a 32-bit integer and one should be able to extract the string with an AND, and the length with an unsigned fixed shift right, which would allow using only one load operation per input byte. Two top bits of all string characters are zero; just put length bits on two topmost bits of the integer…
With high ALU parallelism it might competitive with two-loads-and-no-arithmetic versions – also, it should be vectorisable. I suspect I’m too lazy to implement it, at least today.
It should be plausible to pack both the string and the length in 32-bit integer is what I meant…
Merge request is in, and this “combined table” beats others on Mac M1; no idea how it fares on x86.
$ ./str
Running tests.
Functions are ok.
Time per string in ns.
std::to_string (1) 85.8071
std::to_string (2) 90.9182
blog post (similar) 18.6132
blog post 18.5872
Dimov 1 11.8821
Dimov 2 18.5636
Zukowski 4.15277
Zukowski by oc 4.22096
thin table 4.54168
combined table 3.78396
And with a minor modification, I was able to push this to 3.3 ns on Mac M1. I wonder how large a portion of that is actually the call to string allocation routine?
It is a good question.
I believe the std::string calls (inlined or not) are a significant source of overhead when single-iteration run time of a function, like ipv81, is around 3 nanoseconds. Just eliminating single store saved 6% in run time, and in comparison to hypothetical plain C string variant, the code zeroes the string unnecessarily (optimally one or two stores, depending on microarchitectural details), and does whatever the constructor does, at best (if worse, it could do a lot more), which is pretty certainly more than just a little, since this code doesn’t get inlined.
In theory compilers could do better, but apparently not in practice…
I rewrote ipv81 with plain C, writing a zero-terminated string to a fixed, pre-allocated buffer (and forcing compiler to run all iterations properly, I did check the assembler output). My M1 succeeded running this routine at 1.19 ns/iteration, while earlier variations were 3.04-3.3 ns depending on slight variations in code. So, vast majority of benchmarked time on the fastest variations is probably “overhead.”
… and that can be still pushed further, to 1.02 ns, of which probably quarter a nanosecond goes to looping. If lookup table entries are made 64-bit the masking of the string can be avoided, and this allows dropping per-iteration time to 0.88 ns, but I highly doubt that makes much sense in the real world.
So, with hot caches one can achieve roughly one-nanosecond (3.25 cycle) throughput on IPv4 string generation, at least on amenable microarchitectures.
Did you happen to keep this code around?
I don’t think I have it on version control, but there wasn’t really anything particularly fancy; just handle strings as static C char arrays when possible, and avoid dynamic allocation in inner loops. My general, C++ strings oriented changes are in the repo, though.
One thing I noticed was that these routines didn’t really translate that well to x86. Certain idioms of ARM assembler such as barrel shifters integrated to other instructions made the implementation significantly more efficient on ARM than x86.