Although your compiler should sort this out for you – it would be preferable to use the pre-increment operator unless you explicitly want the post increment.
Can you demonstrate an example in C where there is any performance difference between pre-increment versus post-increment? (I do mean “C”, as in C99. Not C++.)
wmusays:
Daniel, did you see what decent compiler do with the scalar code? GCC 5.4 (and newer) make attempt to vectorize the code: https://godbolt.org/g/uXl6g1
The result is not the best code possible, thus the performance isn’t good. Anyway, I’m impressed.
wmusays:
BTW, I rewritten your basiccount method using some SWAR tricks. The procedure from gist[1] works at 0.45 cycles on Skylake.
It seems that icc and clang also vectorize the problem. It is not surprising considering how common such code is likely to be. What is slightly disappointing is how they all seem to get it wrong, performance wise.
Cristi Csays:
According to https://github.com/vks/bytecounttest, llogiq’s solution runs in about 29ms against your 40ms in the `avxu` version (if I read that benchmark correctly). I’m a little worried about differences around inlining (since the bench code is written in Rust, linking with your C code), but I’d say that the perf is very much in the same ballpark.
Hmmm… the numbers I see reported are 29 ns vs. 40 ns. From my naive look at the Rust code, this seems to be measuring the time to process a whole string.
Cristi Csays:
Sorry, I meant `ns`, just typo’d `ms` vs `ns`. I kept the same ratio though 😛
Sorry, I was not able to run the benchmarks so far due to a build script error I don’t have sufficient time to debug at the moment.
In case it helps, I personally have met vks at multiple meetups and trust his results are accurate (on his machine, YMMV).
To help explain this, bytecount in its current version also uses AVX, and still does the intermediate reduction, which enables it to count more bytes with one 512bit counter.
In case it helps, I personally have met vks at multiple meetups and trust his results are accurate
It is not a matter of putting into question the integrity of the individual. Rather it is a matter of having numbers that say what you think they say.
I report processing input bytes at a rate of 0.04 cycles per byte.
What should I infer from the 29 ns vs. 40 ns number? That he tested a faster approach that can process input bytes at a rate of 0.03 cycles per byte?
That’s not an impossible number… but is that what is being claimed?
You see my problem?
When I write that I do not trust these numbers, I don’t mean that the author made up numbers or cheated in some way… I mean that I am not confident people know what they mean exactly.
Something else bothers me. He reports two orders of magnitude gains over the basic code. I report a single order of magnitude difference. What explains this difference? Possibly the inefficiency of the Rust compiler, but once we start factoring in the possibility that the Rust compiler is deficient, all sorts of considerations must enter into play.
To put it bluntly, we need a lot more analysis to understand what is going on.
At this point, we do not have an analysis (e.g., how many CPU cycles are used per input byte) and it is not straight-forward to reproduce the benchmark.
I’ve upgrated a similar sse2 implementation to avx2:
https://gist.github.com/powturbo
see also:
https://www.reddit.com/r/coding/comments/44ri8a/beating_the_optimizer/
Although your compiler should sort this out for you – it would be preferable to use the pre-increment operator unless you explicitly want the post increment.
Can you demonstrate an example in C where there is any performance difference between pre-increment versus post-increment? (I do mean “C”, as in C99. Not C++.)
Daniel, did you see what decent compiler do with the scalar code? GCC 5.4 (and newer) make attempt to vectorize the code: https://godbolt.org/g/uXl6g1
The result is not the best code possible, thus the performance isn’t good. Anyway, I’m impressed.
BTW, I rewritten your basiccount method using some SWAR tricks. The procedure from gist[1] works at 0.45 cycles on Skylake.
[1] https://gist.github.com/WojciechMula/6200d3991c366b7d6b53c2dd35b785dc
It seems that icc and clang also vectorize the problem. It is not surprising considering how common such code is likely to be. What is slightly disappointing is how they all seem to get it wrong, performance wise.
According to https://github.com/vks/bytecounttest, llogiq’s solution runs in about 29ms against your 40ms in the `avxu` version (if I read that benchmark correctly). I’m a little worried about differences around inlining (since the bench code is written in Rust, linking with your C code), but I’d say that the perf is very much in the same ballpark.
Hmmm… the numbers I see reported are 29 ns vs. 40 ns. From my naive look at the Rust code, this seems to be measuring the time to process a whole string.
Sorry, I meant `ns`, just typo’d `ms` vs `ns`. I kept the same ratio though 😛
Yes, but I cannot trust these numbers. Did you manage to compile this code and run it for yourself?
Sorry, I was not able to run the benchmarks so far due to a build script error I don’t have sufficient time to debug at the moment.
In case it helps, I personally have met vks at multiple meetups and trust his results are accurate (on his machine, YMMV).
To help explain this, bytecount in its current version also uses AVX, and still does the intermediate reduction, which enables it to count more bytes with one 512bit counter.
In case it helps, I personally have met vks at multiple meetups and trust his results are accurate
It is not a matter of putting into question the integrity of the individual. Rather it is a matter of having numbers that say what you think they say.
I report processing input bytes at a rate of 0.04 cycles per byte.
What should I infer from the 29 ns vs. 40 ns number? That he tested a faster approach that can process input bytes at a rate of 0.03 cycles per byte?
That’s not an impossible number… but is that what is being claimed?
You see my problem?
When I write that I do not trust these numbers, I don’t mean that the author made up numbers or cheated in some way… I mean that I am not confident people know what they mean exactly.
Something else bothers me. He reports two orders of magnitude gains over the basic code. I report a single order of magnitude difference. What explains this difference? Possibly the inefficiency of the Rust compiler, but once we start factoring in the possibility that the Rust compiler is deficient, all sorts of considerations must enter into play.
To put it bluntly, we need a lot more analysis to understand what is going on.
At this point, we do not have an analysis (e.g., how many CPU cycles are used per input byte) and it is not straight-forward to reproduce the benchmark.