Out of curiosity, which compiler & version did you use for the C++? (The small string optimization in recent C++ makes them very close to fixed length strings for small strings, at the cost of a few more instructions to decide which version is being used.)
The small string optimization in recent C++ makes them very close to fixed length strings for small strings (…)
My results are much the same under clang. With the Intel compiler, fixed-length strings are no longer any faster, but that’s because they get to be as slow as the variable-length ones.
I don’t have easy access to old compilers on this exact machine.
davetweedsays:
I’ve had a look at your code and I wouldn’t be surprised if the biggest contribution to the speed difference is that the fixed length 8 padded-strings are allowing you to use a very optimized comparison function. (I don’t have a profiler setup on my new machine yet.)
This is an interesting result and I agree fixed length strings are interesting, I’m just interested in figuring out what precisely is happening on the machine.
I suspect that a real application spends very little overall time sorting strings, so optimizing that part will have little impact on the overall performance.
I recently profiled some data acquisition middleware, and discovered that it spent 93% of it’s time waiting on the driver to deliver raw data. Optimizing the remaining 7% would be largely a waste of my time.
When looking closely at the driver – I further uncovered that I was operating at about 70% of the underlying transport mechanisms’ capacity – and improving this would require better firmware on the device side (reducing gaps between bytes and strings).
So – even though I was adding arrays of integer numbers together, and I could see ways to optimize these, it would make no appreciable difference overall.
Many systems have their performance bound set by IO, network… or other architectural constraints. That’s why we build massive systems using JavaScript (which does not even support parallelism!).
Abhimanyu Rawatsays:
As you said it’s application dependent, in MySQL where a fixed length is required the performance should be better. And going further in C padded where the string fits in word length only, there is some architecture dependency as well right, for tuning it to avoid pointers ?
GCC strings are notoriously bad, because they are copy-on-write. They even used to have atomic operations involved in simple constructors, not to mention that GNU couldn’t implement a bug-free string implementation for a multi-core environment for many years. Copy-on-read strings are probably better (unfortunately, STLPORT died).
That said, even with copy-on-read strings, accessing the string requires a dereference operation, which slows things down. Try a variant of the string, which keeps the counter followed by the data buffer. This one could be fast, IMHO.
davetweedsays:
I think that used to be the case prior to the changes for C++11. (Not that g++’s stuff is brilliant, but it does use the SSO now.)
BTW, Java’s string are immutable for a good reason. In a multi-threaded environment, it’s often much cheaper to copy a small string than to block a gazillion of threads on trivial string operations.
Rust and Swift offer both mutable and immutable strings.
KWilletssays:
We’ve unfortunately hit the point where most dictionary words will fit into a single machine word; the classic benchmark of sorting the dictionary turns out slower using classic algorithms than sorting it fixed-length.
It seems like adding length to the type in certain cases, similar to std::array, ultimately gives the compiler/optimizer more information to optimize with.
Out of curiosity, which compiler & version did you use for the C++? (The small string optimization in recent C++ makes them very close to fixed length strings for small strings, at the cost of a few more instructions to decide which version is being used.)
Out of curiosity, which compiler & version did you use for the C++?
The small string optimization in recent C++ makes them very close to fixed length strings for small strings (…)
My results are much the same under clang. With the Intel compiler, fixed-length strings are no longer any faster, but that’s because they get to be as slow as the variable-length ones.
I don’t have easy access to old compilers on this exact machine.
I’ve had a look at your code and I wouldn’t be surprised if the biggest contribution to the speed difference is that the fixed length 8 padded-strings are allowing you to use a very optimized comparison function. (I don’t have a profiler setup on my new machine yet.)
This is an interesting result and I agree fixed length strings are interesting, I’m just interested in figuring out what precisely is happening on the machine.
Still not clear whether you benchmarked the C++11 std::string or C++98 std::string? http://developers.redhat.com/blog/2015/02/05/gcc5-and-the-c11-abi/
clang can use the GNU C++ standard library, so the next question is which C++ standard library did you benchmark with clang?
The command line I used to compile my small program is in the first line of the source code:
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2016/10/05/pointersort.cpp
This is a fresh Ubuntu install (see previous comment for version number).
I encourage you to run your own experiments.
Memory alignment could be playing a role over here.
A useful test might be to allocate memory for std::string using aligned_alloc (alignment = 8).
I’d be interested in any number you have but I am skeptical regarding memory alignment in this case. I doubt memory access is the bottleneck.
I do not have Ububtu, but I still would like to know which std::string was used in your benchmarks.
I suspect that a real application spends very little overall time sorting strings, so optimizing that part will have little impact on the overall performance.
I recently profiled some data acquisition middleware, and discovered that it spent 93% of it’s time waiting on the driver to deliver raw data. Optimizing the remaining 7% would be largely a waste of my time.
When looking closely at the driver – I further uncovered that I was operating at about 70% of the underlying transport mechanisms’ capacity – and improving this would require better firmware on the device side (reducing gaps between bytes and strings).
So – even though I was adding arrays of integer numbers together, and I could see ways to optimize these, it would make no appreciable difference overall.
@Dominic
You’ll get no argument from me.
Many systems have their performance bound set by IO, network… or other architectural constraints. That’s why we build massive systems using JavaScript (which does not even support parallelism!).
As you said it’s application dependent, in MySQL where a fixed length is required the performance should be better. And going further in C padded where the string fits in word length only, there is some architecture dependency as well right, for tuning it to avoid pointers ?
in MySQL where a fixed length is required the performance should be better
I expect so but I’d be interested in seeing numbers.
GCC strings are notoriously bad, because they are copy-on-write. They even used to have atomic operations involved in simple constructors, not to mention that GNU couldn’t implement a bug-free string implementation for a multi-core environment for many years. Copy-on-read strings are probably better (unfortunately, STLPORT died).
That said, even with copy-on-read strings, accessing the string requires a dereference operation, which slows things down. Try a variant of the string, which keeps the counter followed by the data buffer. This one could be fast, IMHO.
I think that used to be the case prior to the changes for C++11. (Not that g++’s stuff is brilliant, but it does use the SSO now.)
Good to know, thank you! Is 1024 char a short string?
To clarify: in the test, my strings are short (much less than 8 bytes), but I sort 1024 of them.
BTW, Java’s string are immutable for a good reason. In a multi-threaded environment, it’s often much cheaper to copy a small string than to block a gazillion of threads on trivial string operations.
That’s a good point.
Rust and Swift offer both mutable and immutable strings.
We’ve unfortunately hit the point where most dictionary words will fit into a single machine word; the classic benchmark of sorting the dictionary turns out slower using classic algorithms than sorting it fixed-length.
Some current algorithms use the fixed-length method by caching 4-8 bytes in an int next to the string pointer. For instance here: http://rd.springer.com/chapter/10.1007%2F978-3-540-89097-3_3#page-1 . It is indeed faster.
It seems like adding length to the type in certain cases, similar to std::array, ultimately gives the compiler/optimizer more information to optimize with.