23rd July 2012, 105 min read

Is C++ worth it?

106 thoughts on “Is C++ worth it?”

Siah (@siah) says:

July 23, 2012 at 2:28 pm

Hi Daniel,

You start by highlighting the role of efficient languages in energy consumption of devices. So let me ask you this. Don’t you think that languages like C++ can help in lowering the energy usage?
mars says:

July 23, 2012 at 2:28 pm

Wow it’s surprising how fast Java is. Who would have guessed?!

If we are doing scientific computation, and sending off the code to clusters, would Java perform similarly good?
Daniel Lemire says:

July 23, 2012 at 2:38 pm

@Siah

Don’t you think that languages like C++ can help in lowering the energy usage?

Yes, I think so.
Daniel Lemire says:

July 23, 2012 at 2:40 pm

@mars

If we are doing scientific computation, and sending off the code to clusters, would Java perform similarly good?

A lot of cluster computing is done in Java and it seems to work well.
Max Lybbert says:

July 23, 2012 at 2:58 pm

Is there any reason std::partial_sum isn’t in the test ( http://www.cplusplus.com/reference/std/numeric/partial_sum/ )? Sure, most C++ programmers probably do write the loop out as shown, but personally, I reach for STL algorithms before I try optimizing by hand.
Daniel Lemire says:

July 23, 2012 at 3:18 pm

@Max

Good point.

Is partial_sum fine for in place computations?
Donny says:

July 23, 2012 at 3:20 pm

The amount of work gone into the JVM is amazing and the first time I saw the Computer Language Benchmarks site, I was shocked at the tiny marginal difference over C++ (the other surprise being Haskell!).

However, there’s a warm-up/start-up cost to be paid and I wonder how much control you will get in cases where CPU cache issues affect your performance.
Ben L. Titzer says:

July 23, 2012 at 3:27 pm

Please stop posting misleading microbenchmarks like these. Your methodology is not rigid and your benchmark is 3 lines of code. The whole loop might be replaced with a nop if the array doesn’t escape.
Max Lybbert says:

July 23, 2012 at 3:29 pm

I believe partial_sum is OK for in place computation (SGI’s version was: http://www.sgi.com/tech/stl/partial_sum.html , Note 1) and Microsoft’s version is ( http://msdn.microsoft.com/en-us/library/et8wc8tt.aspx , first sentence under “Remarks”).
Pierre Barbier de Reuille says:

July 23, 2012 at 3:32 pm

Interestingly, here is what I get for the basic sum implementation:

Java: ~1470
gcc 4.7 with -O3: ~960
clang 3.1 with -O3: ~1350

optimized C++ version:
gcc 4.7: ~1350
clang: ~1470

Also, if you replace your use the std::partial_sum function, with gcc 4.7, you get ~1100, but it get slower with clang …
Daniel Lemire says:

July 23, 2012 at 3:46 pm

@Ben

The whole loop might be replaced with a nop (…)

It could but it is not.

Your methodology is not rigid

Fair enough, but are you saying that my conclusions are wrong?

your benchmark is 3 lines of code.

These could be 3 very important lines of code in an application.
And says:

July 23, 2012 at 4:07 pm

Why do you think Java is a higher level language as C++? Higher because it has one abstraction level more?

I think that is not the usual definition.
Daniel Lemire says:

July 23, 2012 at 4:12 pm

@And

I was going to add a disclaimer saying that some people would argue that Java is not a higher level language.

Just accept for the sake of this discussion that Java is a higher level language.
Itman says:

July 23, 2012 at 4:14 pm

Ben L. Titzer,

it is a well-known fact that higher-level languages can optimize the code better than C++. Given that Java is equipped with a just-in-time compiler/profiler, the results here are not surprising at all.

Next, there are scenarios when a garbage collector can be much faster than malloc (though, in general performance is comparable).

Last, but not least is that copy-on-read behavior favored in many (especially functional) languages can be beneficial in multi-threading applications. Few people know, but the current reference-based implementation of strings in GCC can kill your performance! I know examples, when using STL leads to a 2 fold increase in performance (for the application in general, not only for a small part).
Christan says:

July 23, 2012 at 9:18 pm

The STL is tough to optimize. The vanilla array structures beat fast sum std::vector about 1560 to 1390.

int data_array[50*1000*1000];
for (size_t i = 0; i != 50*1000*1000; ++i) {
data_array[i] = (i+i/3);
}

for (size_t i = 1; i != 50*1000*1000; ++i) {
data_array[i] += data_array[i – 1] ;
}

But your point is well taken.
Andy says:

July 23, 2012 at 4:20 pm

But if Java is is not a higher level language, then you make no point.
The numbers you provide are just wrt Java. looking at any script language in the benchmark game reveals big differences between C++ and Java and languages like Ruby, Python, Lua, Clojure,…

If you want to argue “Java is sometimes as fast as C”, ok. Ben gave some good arguments why this can be the case.

But this is not really connected to “higher level” imho.
J. Andrew Rogers says:

July 23, 2012 at 4:28 pm

Java has long been competitive with C/C++ for tight numerical loops.

Where C/C++ is consistently integer factors faster than Java in the performance department is in memory-intensive codes. Hence why, for example, C/C++ is still the preferred language for database engines and high-performance server code. Even if you use ugly non-idiomatic Java for performance purposes, it is not uncommon for C++ to be 2x faster for apps that abuse the memory subsystem.
Preston L. Bannister says:

July 23, 2012 at 4:35 pm

First *optimal* C++ code might be faster or more efficient than Java (or other higher level code). Most programmers do not write optimal code most of the time. Average programmers tend to turn out fairly horrible code, and C++ code can be hideously dangerous, in careless hands. These are not casually interchangable choices.

Second, the simple example given here is to raise a valid point, not to provide comprehensive proof. Some sorts of code optimize to near the same level of performance across languages.

Third, your choice of algorithm can make a greater difference than a more efficient compiler. If a higher order language allows you to write the first version in less time, and eases later refactoring, then you have the opportunity to fold in better algorithms when you understand the problem better. So for the same programmer effort, a higher order language can yield a more efficient end product.
Gregory says:

July 23, 2012 at 4:38 pm

I also believe such a microbenchmark lead people to draw wrong conclusions yes

You didn’t even look at the generated asm code did you?
You’re missing all the steps of:
1) benchmark something real, with a wall clock (the time command is useful)
2) profile and identify hot spots
3) look at generated assembly

Here are the results on my linux, g++ 4.7.1, x86
basic sum 409.836
smarter sum 403.226
fast sum 409.836

and here are the results on my mac, g++ 4.2, x86_64
basic sum 1086.96
smarter sum 724.638
fast sum 847.458

what do my numbers tell you? and then, how smart are the smart implementations?

here is the modified code:
http://pastebin.com/F67R8pUK

depending on the compiler, data.size() calls may be issued PER ITERATION
same goes for operator[]’s implementation that may dereference this->_m_data or whatever it’s called in your STL implementation again PER ITERATION

why is it so? the C++ optimizer might not realize size() hasn’t changed, or this->_m_data is the same as the previous iteration etc… there are many reasons for that, one of them can be pointer aliasing

interestingly enough, JIT compilers/optimizers likely realize this and perform well!

likely, you can come up with a microbenchmark involving virtual functions: the C++ compiler would generate indirect vtbl calls per iteration while at somepoint the JIT compiler might realize there’s nothing virtual and it’s always the same leaf method invoked again and again and save indirect call. get the idea?

anyways, very capable people explain that better than me and once in a while someone on the internet pops out with a naive (excuse me but it is) microbenchmark and draws big conclusions about languages

cheers,
g
Daniel Lemire says:

July 23, 2012 at 5:04 pm

@Andy

Look at JavaScript in the benchmarks. It fares nearly as well as Go.
Curious mind says:

July 23, 2012 at 5:18 pm

Your test shows you have little understanding of how a processor works. The way the code is written leaves all to optimized at the processor’s registers, you findings have little to no real significance. Try writing code that will force data to be copied to and from registersRAM and run you comparison again. Only then you may find something meaninful… then revise your perspective.
Daniel Lemire says:

July 23, 2012 at 5:40 pm

@Gregory

while someone on the internet pops out with a naive (excuse me but it is) microbenchmark and draws big conclusions about languages

Here’s what I wrote: “Of course, from a sample of a 3 compilers on single problem, I only provide an anecdote, but there are extensive and well established benchmarks supporting my points”.

You didn’t even look at the generated asm code did you?

Yes I did.

You’re missing all the steps (…)

How do you know I did not benchmark something real, identify the hot spot, read the assembly, tried with different compilers? Because this is exactly what I did and I’ve got witnesses.

what do my numbers tell you? and then, how smart are the smart implementations?

The version I do report in the blog post is a default implementation in C++, something many programmers would use. Yes, you can do better, but I also acknowledge this.

For the record, I did test with GCC 4.2, but using different machines… here are the results…

My GCC 4.2 tests using my code on my macbook air (I keep only best results):

$ ./cumulsum
reporting speed in million of integers per second
test # 0
basic sum 120.773
smarter sum 352.113
fast sum 847.458

Same test, same machine, with GCC 4.7 (still on the macbook, again best results):

$ ./cumulsum
reporting speed in million of integers per second
test # 0
basic sum 1111.11
smarter sum 1388.89
fast sum 1388.89

Best Java result: 1315.

Here are the results on an older iMac… first with GCC 4.2:

$ ./cumulsum
reporting speed in million of integers per second
test # 0
basic sum 537.634
smarter sum 549.451
fast sum 617.284

next with GCC 4.7:

$ ./cumulsum
reporting speed in million of integers per second
test # 0
basic sum 1041.67
smarter sum 1351.35
fast sum 1351.35

Best result using Java: 1219

Of course, all my tests are done using a core i7 processor. Results will vary if you have a different processor.

Still, in both cases, Java beats GCC 4.2 by a wide margin and GCC 4.7 is not much faster.

BTW your numbers appear low. How old is your hardware? My iMac dates back to 2009 and it can easily do over 1000 mis.
Daniel Lemire says:

July 23, 2012 at 5:42 pm

@Curious mind

Your test shows you have little understanding of how a processor works. The way the code is written leaves all to optimized at the processor’s registers, you findings have little to no real significance. Try writing code that will force data to be copied to and from registersRAM and run you comparison again.

Why on Earth would I introduce extra copying?
bad_sheep says:

July 23, 2012 at 6:47 pm

This is exactly my feeling. I am in a field where performance has its kind of importance (Video), we are probably one of the very few companies working in Java.

Despite some of our products are not supposed to be well fitted for Java (always reading the same kind of data very fast, but without handling ourself the allocations), we can reach the same (most of the time, it is even better) performance as our competitors because we simply spend less time on micro optimization while we can work on real algorithms optimizations.

But the key point is our team can work less than half the time that would needed for the competitor to have the same product. I am not event talking about parallelization which is sometime very difficult to perform in C/C++ while Java (while not being the best at that game) offers far more tools.
sumodds says:

July 23, 2012 at 7:43 pm

I can see what you are saying to some extend.

Since many have been complaining about realistic benchmark, here is a decent one. As you and many other mention, java might be off by a factor of two.
http://shootout.alioth.debian.org/

I think this kind of goes in line with Paul Graham’s thoughts, where he explains ITA use of lisp for leverage on a specific class of problem.
http://www.paulgraham.com/icad.html
Daniel Lemire says:

July 23, 2012 at 8:40 pm

@sumodds

Thanks. I do link to the shootout in the post, even though my critics failed to notice. 😉
Max Lybbert says:

July 23, 2012 at 9:56 pm

@Christian beat me to it: the fact that the original code uses a std::vector places a limit on how much can be handled at compile time. Assuming that you have the stack space, using a stack-allocated array would give the compiler enough information (where every element of the array is allocated, and what that value will be) to actually do the entire calculation at compile time and allocate the array with the final results. It’s hard to imagine a more optimized version than that.

I’m not sure if any compilers would go that far, but it would be theoretically possible.
Peter Ashford says:

July 23, 2012 at 10:14 pm

I’m a Java programmer and I’ve done lots of very high performance code in Java (I used to be a C/C++ programmer, so I know what the difference is in optimising both platforms also). I agree with your core point: Java can be very fast but also fast at much less development cost than C or C++ programs, but one area where C and C++ always win over Java (at least in my benchmarks) is memory usage – if use of memory must be minimised then C/C++ programs can often be made that use half the memory of an equivalent Java program. Personally, I’m ok with that trade-off: Memory is cheap and abundant for most purposes whereas development time is a big deal.
Chris Nahr says:

July 24, 2012 at 3:37 am

Interesting experiment. On a Core i7 920 machine with Windows 7 x64, the 64-bit Oracle JVM produces ~1470 mi/s, similar to other results posted here.

Then I felt like being mean and tested Microsoft Visual C++ and C# (both VS2010 SP1, max optimization). The results were only ~450, roughly the same for both compilers and all tests, even when I used a C++ array instead of vector. That matches my anecdotal experience that Microsoft’s compilers are rather terrible at optimizing numerical code.

Overall I think the results posted here are illuminating, not because Java beats C++ in some benchmark but because such a simple Java algorithm beats so many C++ compilers. If you need a specific combination of C++ compiler and hand-optimized C++ algorithm to narrowly beat a dead simple Java algorithm, that’s a very compelling argument for Java and against C++.
Jimmy Smith says:

July 24, 2012 at 12:43 am

I find this sort of example a little frustrating. It is similar to what a Harvard grad must think when hearing, “My school’s Energy Economics department is just as good Harvard’s.”

For almost any language/implementation, you can find an example were it performs better than the similar C program. The real performance issue is what do you do when your code is off by a factor of 10 or 100 from the expected run time? With C and some experience, you can almost always coax the compiler into doing the right thing. With Java, there can be some limitation in the JVM implementation which you cannot work around. In which case, you are out of luck (or you use JNI).

This along with malloc/free (I cannot afford to wait for the garbage collector to do its job) is why C and C++ (and Fortran) are still being used. And will continue to be used.

That said, unless you are one of the small number of people writing HPC code for multi-million dollar computers, C is probably not the best language for you. I agree with the author; Java is a great language. However, I find this speed argument does Java a disservice.
Pierre Barbier de Reuille says:

July 24, 2012 at 1:43 am

Just a few other things: as you know, the new C++ introduced features with the purpose to be more efficient. And indeed, if you add the function:

void partialSum(std::vector& array)
{
int acc = 0;
for(int& i: array)
{
i = (acc += i);
}
}

Then, you get with gcc on my machine ~1350 (~1470 for java)

And, at last, if you use float instead of int, suddenly, java is slower than gcc:
java ~730
gcc 4.7 ~960
clang 3.1 ~960

I find interesting that their optimisation doesn’t trigger for floats.
garbage says:

July 24, 2012 at 2:02 am

Well, Java is a fast language… until GC kicks in and everything freezes for several seconds. 😛 I heard this again recently from developers of social network site written in Java.

I think your benchmark is flawed because it is a micro benchmark with a very clear and small (not much code) “bottleneck”. Sure it’s easy for JIT to identify and optimize it. However in real applications there is no 3-line bottleneck. Most of the time is spent in hundreds and thousands lines of code and JIT adds overhead itself while trying to find some spots for optimization.

Code produced by mediocre Java programmer might perform better that code written by mediocre C++ programmer because of tool support/JIT. However with recent C++11 standard C++ has become much better and it’s less verbose than Java.
Christoph says:

July 24, 2012 at 2:06 am

The C++ programm produces undefined behaviour because it creates integer overflows. 🙂
Kumar says:

July 24, 2012 at 2:34 am

Whoa ! No need to reinvent the wheel with WallClockTimer. Standard way to benchmark is to use time.h (ctime in c++).

Example (http://www.cplusplus.com/reference/clibrary/ctime/difftime/):
time_t start, end;
double overhead, elapsed;
time(&start);
time(&end);
overhead = difftime(end, start);
time(&start);
// code to benchmark goes here…
time(&end);
elapsed = difftime(end, start) – overhead;

With these changes, I get the elapsed time as 0.0 !

Meaning, the problem is how you are measuring! My code is here, it uses another method (using time_t):

https://gist.github.com/3168607
Dominic Amann says:

July 24, 2012 at 8:15 am

I find that in real world programs, the apparent speed of Java seems to diminish as size and memory use increases – so for complex real world numerically intense applications, a core library of c/c++ for number crunching will outperform Java for more than a factor of 2 (and sometimes an order of magnitude). Factor in precise control of memory handling (for large arrays), and c++ is definitely the way to go.
bdsatish says:

July 24, 2012 at 3:44 am

I used C++11’s high_resolution_timer to measure accurately to milliseconds. The C++ version and my results are here:

https://github.com/bdsatish/cumulsum
Gregory says:

July 24, 2012 at 4:00 am

@Daniel

linux: intel core 2 6600 @ 2.4ghz
mac: 2.66 GHz Intel Core i7

compiled with g++ -O3 -o cumulsum cumulsum.cpp

—
ok you did look at asm; but you didn’t tell us you did, and a reader of your post surely doesn’t notice the steps I mentioned are in fact mandatory

the point is the post conveys the false idea a small microbenchmark is enough to draw conclusions about relative language performance
Daniel Lemire says:

July 24, 2012 at 6:59 am

@Gregory

ok you did look at asm; but you didn’t tell us you did, and a reader of your post surely doesn’t notice the steps I mentioned are in fact mandatory

Maybe, but I was not trying to say anything about optimizing code. I am just stressing that Java (and also JavaScript) is faster than you might think. If this was an obvious and well known fact, I don’t think this post would have gotten so much reaction (including angry ones).

the point is the post conveys the false idea a small microbenchmark is enough to draw conclusions about relative language performance

Perhaps it can give out impressions but my post says no such thing and as a scientist I would never draw such conclusions so easily. I do rely and link however to existing benchmarks.

It should be a known fact by now that Java and JavaScript are very fast languages. There are good reasons to favor C++ over Java, but raw speed is probably not a good reason in most cases.

This looks like a political issue. I think a lot of people have been saying “we can’t write this in Java because it will be slower, we need C++”. I think a lot these folks did not run the benchmarks to justify their choices. They work from prejudice.
Daniel Lemire says:

July 24, 2012 at 7:01 am

@Christoph

The C++ programm produces undefined behaviour because it creates integer overflows.

The code in the blog post does, granted, but the actual code being benchmarked does not because we check whether the size is zero. I simplified the code when I copied and pasted it in the blog post.

So I don’t think that’s the issue.
Gregory says:

July 24, 2012 at 7:17 am

@Daniel,

(I’m not angry, sorry if I gave that impression — blame media and english not being my mother tongue?)

I get your point.

“If your sole reason for using C++ is speed, and you lack the budget for intensive optimization, you might be misguided.”

It’s even more than that. If you’re sole reason for using C++ is speed, and you apply OOP like in Java then you’re very misguided. Pointer chasing, scattered heap allocations, lack of data locality, cache trashed by vcalls, etc. All those come from doing OOP by the book and go against program efficiency
GDR! says:

July 24, 2012 at 7:27 am

Yeah, right. I too tried to avoid low level programming when writing a 3-D Kirchhoff Migration implementation (n^5 complexity). I only executed the C# version a few times, each run taking about two days to complete, before moving to carefully optimized C. It now takes only a few hours to complete and I’m still going to rewrite it to OpenCL (which is – guess what – a C variant) to make it faster.

It depends what you’re doing, of course, but for computation intensive applications C is still unbeatable.
Daniel Lemire says:

July 24, 2012 at 7:47 am

@garbage

Well, Java is a fast language… until GC kicks in and everything freezes for several seconds. 😛 I heard this again recently from developers of social network site written in Java.

Garbage collection can indeed be a problem but this is maybe overstated. Good Java programmers reuse objects, for example, to avoid stressing the garbage collector.

This being said, given the choice between writing a social network in C++, JavaScript or Java, I would go with JavaScript.

The C++ version could be faster if you have the kind of budgets that Facebook has. Otherwise, it is almost certainly going to be a nightmare.
Pierre Barbier de Reuille says:

July 24, 2012 at 8:07 am

It is sad to note that what could have been a nice little test turned quickly into C++ bashing, and an unjustified one at that! Try you program with any type beside int and the perfs go down to be largely behind the simplest C++ solution.
You could almost think the java compiler is “cheating” by hyper-optimizing this micro-benchmark. Otherwise, how do you explain that adding short integers is twice as slow as adding integers?
As usual for any benchmark, stick to the what the benchmark says, don’t extrapolate, ever!
Daniel Lemire says:

July 24, 2012 at 8:32 am

@Pierre Barbier de Reuille

It is sad to note that what could have been a nice little test turned quickly into C++ bashing, and an unjustified one at that!

I’m not bashing C++. I write C++ almost every single day, by choice.

how do you explain that adding short integers is twice as slow as adding integers?

I just checked in a new test with shorts. I get that Java is 25% slower with shorts than with ints. I get nothing close to “twice as slow”. And even then, Java can still beat C++ under GCC 4.5 easily.

I don’t think that’s a case where Java optimizes the int case and fails to optimize for shorts. As a rule, you do expect shorts to be slower, unless you have lots of cache misses. (Compilers might be able to turn the use of short to its benefits in specific cases though. But failure to do so is not a surprise to me.)
Christoph says:

July 24, 2012 at 9:46 am

@Daniel I wrote about integer overflow and not about invalid memory accesses. The program overflows int at index 56756 on my machine and data[56756] becomes negative.
This certainly can be ignored here and will not affect the results, but it shows the perils of C++ programming.
Isaac Gouy says:

July 24, 2012 at 10:22 am

>>Really clever and hard working C/C++ programmers can beat higher-level languages by a wide margin given enough time. However, their code will typically become less portable and harder to maintain.<<

Maybe the Java and JavaScript programs that perform well in the benchmarks game were contributed by clever Java and JavaScript programmers who put a lot of effort into those programs?

Maybe those Java and JavaScript programs have also become less portable and harder to maintain?

(Less portable in the sense that the same performance is not maintained – performance is brittle).
Daniel Lemire says:

July 24, 2012 at 10:56 am

@Isaac Gouy

The shootout benchmark is open source and you can inspect their source code. It looks good to me.

There is no doubt that you can make JavaScript unreadable, but my point is that it is unfair to compare a heavily optimized C++ program with a straight JavaScript program. C++ offers you more room for improving your performance, so if you have a lot of time on your hands, then go for C++. But in real life, people have to ship software at some point. So an honest comparison is one where you just take nice looking code in both languages.
sj says:

July 24, 2012 at 1:48 pm

Java uses jvm optimization on the loop.

so it is silly to compare the java loop and c++ loop just on the source file. You need to compare the assembly language output, and try to match the source code on the both language to get a fair stat.

try running your iteration 20 times and see the difference. Why is looping 1mil times important? If that is important then just use linear algebra lib ..
Keck says:

July 24, 2012 at 2:05 pm

Have you tried a good C or C++ implementation using intrinsics?

Java 1785
GCC 4.7 (with -funroll-loops) 2000
Is that not more than 10%? Are you saying loop unrolling is not a “fair optimization” to compare or use? If you’re doing this type of operation, you’d be silly to not take advantage of it.

I agree with you, that unless you have the time/ability/compiler/etc to make a very optimized by hand version, it’s probably not worth, except in the cases it is worth it. The point I feel is better expressed is, each language has its purpose, but sometimes, for some problems, you get more for free in different languages.

I’ve worked on some absolutely, incredibly, stringent, projects where I couldn’t use say the Math Kernel Library Matrix Multiplication, even though it is super optimized, because it wasn’t optimized as well as it could be for some obscure matrices’ sizes, e.g. m = 28×16384 and n = 16384×856. SO, I got to write a hand version and sometimes meta template programming won for some sizes on smaller multiplications, sometimes(most of the time for me) intrinsics won when you get to this level of optimization. It depends almost completely on what you’re trying to accomplish.

Now, the C++ compilers, more recent ones catch the easy ones for a lot of this by planting SSE2 or better in where it can (why your GCC 4.7 is so much faster than GCC 4.5), the loop unrolling can save a lot too, especially as your iterations gets larger. So, I don’t consider that an unrealistic optimization. Java’s VM most definitely attempts to do loop unrolling.

In this example, by how you are doing the sum, you are masking some of the power of what the C/C++ compiler can do, or hand optimization in C/C++ period, can do by how you’re choosing to do your sum. Because of how you are doing the sum, you are forcing the inrinsics to use an unaligned load for the 128bit representation of the integer, because you are always doing “i – 1.” Meaning never on the boundary, meaning more costly load even if it does catch the SSE optimization. If you could spare it up front, to have two copies, your sum might be faster. Also, I can decide based on how many registers I have personally to do things like say do 4, 128bit loads at a time and accumulate partial sums and then add up the sub blocks. Or, if I know my values are small, I can saturate a 128bit block and do addition on smalls and do 8 at a time. The point is I think it’s one thing to say, “there are examples where other languages get you pretty great results out of the gate,” but it’s another thing to have an algorithm that is not the optimal way of doing things in one language only to say, “look it’s not much better than language X.”

const size_t C_SIZE = 100000000; //- or whatever sufficiently large value, if it’s small you wouldn’t do intrinsics anyways
for (size_t i = 1; i != C_SIZE – remainder; i+=4)
{
__m128i left = _mm_loadu_si128( (__m128i *)&data[i] );
__m128i right = _mm_loadu_si128( (__m128i *)&data[i – 1] );
_mm_storeu_si128( (__m128i *)&data[i], _mm_add_epi32(left, right) );
}

for (size_t i = C_SIZE – remainder; i != C_SIZE; ++i)
{
data[i] += data[i – 1] ;
}

That’s one thing, but thta’s a lot slower than if I have a second in memory that is 16byte aligned and “one ahead” of the other.
for (size_t i = 1; i != C_SIZE – remainder; i+=4)
{
__m128i left = _mm_load_si128( (__m128i *)&data[i] );
__m128i right = _mm_load_si128( (__m128i *)&data2[i ] );
_mm_store_si128( (__m128i *)&data[i], _mm_add_epi32(left, right) );
}

for (size_t i = C_SIZE – remainder; i != C_SIZE; ++i)
{
data[i] += data[i – 1] ;
}

I wouldn’t write either that way though. I have too many architecture questions to worry about. This code is not portable. It’s fragile, but it has it’s place.
That aside, the Intel Compiler will catch if I just have two data structures, one replicated and “one ahead” and both 16 byte aligned, and turn:

for (size_t i = 1; i != data.size(); ++i) {
data[i] += data[i – 1] ;
}

into something beautiful (most of the time)

So, it’s a “use the right tool for the job” argument, in my mind. Each language has it’s place, sometimes you get more stuff for free in other languages. BUT, I think your example is slightly faulty because you don’t provide an avenue with how you write your algorithm – if you really are trying for speed of cumulative sum – for the C/C++ compilers to do all they can for you, for free.
Daniel Lemire says:

July 24, 2012 at 2:38 pm

@Keck

Have you tried a good C or C++ implementation using intrinsics?

No. I also did not look at a solution using assembly code.

Though I did not examine all of the assembly code for all the compilers, my impression is that SSE is not used to optimize the main loops. As evidence, the numbers don’t change if I had “-mno-sse2” to the compiler flags.

Are you saying loop unrolling is not a â€œfair optimizationâ€ to compare or use?

No but it is not a default switch because it can worsen performance. Anyhow, I do provide the results with -funroll-loops in my blog post.

Is that not more than 10%

That’s 10.8%. Literally speaking higher than 10%.

it’s another thing to have an algorithm that is not the optimal way of doing things in one language only to say, â€œlook it’s not much better than language X.â€

These three lines of codes are how the majority of C++ programmers would code it. The use of intrinsics you describe is not common C++.
Keck says:

July 24, 2012 at 3:21 pm

@Daniel Lemire

No doubt it’s not normal C++, that doesn’t mean the compiler can’t do it or there aren’t developers that can’t. ICC and even VS 2010, and definitely GCC 4.7 will optimize that loop for such a simple sum operation to some SSE2 if the architecture supports it, anything since 2000 pretty much. ICC will do a better job than either out of the box. BUT, I know for a fact that GCC 4.7 optimizes that loop very differently than GCC 4.5, and that’s most likely for two reasons: GCC 4.5 was released before his your processor, also GCC put in a lot of SSE and vectorization in GCC 4.6/GCC 4.7.

So, it’s not “normal” but it’s possible. So it’s still a “pick the best tool for the job,” and as the C++ compilers get better, yes people work on that one too to make it better all the time not just Java, it’ll catch more and more of these types of optimizations. Heck, even microsoft from vc2008 to vc2010 did a ton of work in this.

My point was this particular example isn’t necessarily the best algorithm to prove what the compiler can do by itself to optimize your code, because of how you’ve decided to do the sum. For instance, the Intel Compiler even picks up your version as being aliased differently and writes better assembly that uses some shuffling to align as it goes in a different register that it determined it didn’t need to use to do the sum operation.

1785 to 2000 is not 10.8% difference, or am I missing something? I do think when I’m benchmarking things in a loop that loop optimizations for the benchmark do matter. ALSO, I can pragma in the optimization just for the loop I want and not have everything subject to the unrolling.

Loop unrolling isn’t a standard optimization, most definitely because it can worsen performance. Pick a low prime number(I say prime because I’ll assume the Duff’s device is going to try 4, 8, 16, 32 and I’m just covering my bases) and watch it really make it worse. But, weren’t you comparing this in general over an extreme example? Unless you’re in the science field, when are you writing a program that is going to do a million cumulative sums? Then, you’re more likely to pick your language based on the libraries available to it. I might even go fortran if I’m doing Digital Signal Processing and my client needs/uses Column Major Format. It’s almost always going to be “best tool for the job,” or what libraries do I have for this that make my life easier.

My point is the example you picked has a hole with the compilers you used that hides some of the performance difference. I’d try a test with all languages/compilers you care to that does something that all should be more equally good at. Otherwise, it’ll always come down to the problem you are solving.
Daniel Lemire says:

July 24, 2012 at 3:54 pm

@Keck

1785 to 2000 is not 10.8% difference,

(1785-2000)/2000 = 0.1075.

ALSO, I can pragma in the optimization just for the loop I want and not have everything subject to the unrolling.

I tried it in various ways but it failed to work. Can you provide the code that does it?

(…)when are you writing a program that is going to do a million cumulative sums?

Image processing, data compression, numerical analysis, data mining, machine learning, business intelligence… Basically, most areas where performance matters.

My point is the example you picked has a hole with the compilers you used that hides some of the performance difference.

This is an example to illustrate my point, but not my evidence.
Keck says:

July 24, 2012 at 3:55 pm

@Daniel Lemire

I’m sorry, I typed one thing and thought another. yes, 1785 is 10.8% difference from 2000. I lined up those values wrong 🙂
Nick Craig-Wood says:

July 24, 2012 at 3:55 pm

As I’m learning Go at the moment I converted the code here:

https://gist.github.com/3172562

Interestingly it runs at almost identical speed to the “basic sum (C++-like)” using gcc 4.6.3 and g++ -funroll-loops -O3 -o cumuls cumuls.cpp

Go’s compiler isn’t particularly well optimised at the moment but I thought it did OK here.
Keck says:

July 24, 2012 at 4:10 pm

“when are you writing a program that is going to do a million cumulative sums?”
“Image processing, data compression, numerical analysis, data mining, machine learning, business intelligenceâ€¦ Basically, most areas where performance matters.”

Perhaps my question is better said as this, “When are you going to write a program that does a million cumulative sums where you care enough to benchmark but not enough to use the best language for it.”

These are the fields I work in… I’ve done DSP processing and image processing/manipulation for my entire career. So, yes, I do it daily, but most people don’t. I feel your point is better proven on more common tasks. On many non-scientific projects, the gap is probably even closer between the languages!

I don’t even disagree with your point. When I sit down and write prototypes, I use numpy and Python… I’ve delivered things using this because it performed well enough.

I’m just saying on a purely mathematical example, the way the code is written keeps some of the better compilers from optimizing this to a degree where the difference is larger out of the box. I just wanted to make you, possibly you already were aware, that this is actually where the most improvement is being seen/focused on in C/C++ compilers at the moment. Make this floating point math, and do the 10 * log(x^2 + y^2) (Amplitude in dB) in java and C/C++ and show me the speed difference …

The loop unroll pragma is just “#pragma unroll.” It’s in extensions to GCC, and in msvc and ICC by default.
Keck says:

July 24, 2012 at 4:18 pm

I feel C++ is worth it when it’s worth it. It’s not worth it when it’s not – same is true of java over scala at times. Same is true of C over Ada. Or C over Fortran. Same is true as, “how well do I know this language and how hard is my problem in it?” over …

It’s completely based on the problem, knowledge of what language for what type of problem each language is better at. I mean, C/C++ is worth it for me daily. I don’t have a good Java FFT library yet … (If I do it’s a C library wrapped for Java). All these languages have their place! “Languages of the World, unite!(for the job you are best at)”
Daniel Lemrie says:

July 24, 2012 at 4:21 pm

@Keck

Java is used extensively in data mining, machine learning, business intelligence, information retrieval and so on. In another life I even wrote a high performance Java CODEC for wavelet-based compression.

The loop unroll pragma is just â€œ#pragma unroll.â€ It’s in extensions to GCC

Yes, or you can do it with function attributes, but it does not work. It never worked for me and I could never find documented examples on the web where you could use pragmas to selectively apply unroll-loops. I use pragmas for other purposes such as disabling warnings, but it does not work to selectively tweak optimization in C++, in my experience.
Daniel Lemire says:

July 24, 2012 at 4:44 pm

@Keck

I don’t have a good Java FFT library yet â€¦

So your C++ FFT library is consistently much more than twice as fast as JTransform?(https://sites.google.com/site/piotrwendykier/software/jtransforms) You should let the JTransform guys know because they think otherwise.
Keck says:

July 24, 2012 at 5:01 pm

Actually, I use CUDA for FFT when I can anyways, for instance we have one such system that has 5 GPUs and we use CUFFT 4 and split the work up on our 3211264 because we have a 10ms processing window for that particular project. And CUDA 2.0 vs CUDA 4.0 is night and day for performance on FFT and Matrix Multiplication. This is definitely a bad argument too since it’s not a very good CUDA card compared to what’s available now for those large of FFTs. The difference in this probably only got worse.

They also used GCC 4.2.3 for their comparisons? Why?

Wait, GCC 4.2.3, but a newer Xeon processor in a benchmark of baseline optimizations? (okay let’s cook the books on this from a generic optimization standpoint by just throwing a better CPU in this mix, but not use a version of GCC more recent in regards to it). Oh, let’s pick big numbers and use 64bit java too.

… yes, I would bet latest MKL FFT library – if I have to stay complete in the CPU world – would beat it by more than 2x. Even with GCC 4.2.3 it did beat it by 2x most of the time, and it especially beat it by 2x in the FFT sizes that matter most for image/audio processing.

Especially if I have a version that takes advantage of Sandy Bridge/Ivy Bridge Stream Processors …

I do, I absolutely, most definitely disagree with JTransforms and their benchmarks from last year using a 3 year old at that time compiler. This gap in this specific realm has gotten much wider recently. This is not a good example for your case …
David Stocking says:

July 24, 2012 at 6:28 pm

You don’t have a iterator based C++ version even though it basically compiles to the C version … but its still important.

http://pastebin.com/Z1UT0pEA

That being said I think it will be an amazing day when we don’t use C/C++ because of speed.
Daniel Lemire says:

July 24, 2012 at 8:48 pm

@Keck

They also used GCC 4.2.3 for their comparisons? Why?

So you think it is unfair to compare a version of Java released in 2006 with a version of GCC released in 2008?

Well. Yes. I would say it is unfair to Java.

Note that in all my tests here, I have used Java 6, an *old* version of Java (I did not test java 7) compared with the latest and neatest C++ compilers (clang 3.1 and gcc 4.7).

Java 7 is substantially faster (on some benchmarks) than the version of Java I have used for my tests:

http://geeknizer.com/java-7-whats-new-performance-benchmark-1-5-1-6-1-7/

Strange you did not call me on this unfairness?
Daniel Lemire says:

July 24, 2012 at 8:53 pm

@David Stocking

Great idea. The iterator-based approach is actually faster than the C version in my tests under GCC 4.7, but much slower under clang.

I’ve added it.
Vivek Haldar says:

July 24, 2012 at 11:53 pm

My full-length response:

http://blog.vivekhaldar.com/post/27962511395/is-c-c-worth-it
Keck says:

July 25, 2012 at 10:04 am

@Daniel Lemire

http://www.readwriteweb.com/hack/2011/06/cpp-go-java-scala-performance-benchmark.php

https://days2011.scala-lang.org/sites/days2011/files/ws3-1-Hundt.pdf

It matters much more what’s in a version than how old it is, especially one specifically addresses something that change this particular discussion a lot. I’m aware Java 1.7 has had a lot of performance increases. I’m positive that compared to newer versions of FFTW, with a much newer compiler, the margin will be even worse against JTransforms even with Java 1.7. Also, probably even worse against MKL. I won’t pit it against CUDA because that’s not a fair test at all.

But, you seem very sure that JTransforms will not be more than 2x as slow… In things like this where performance matters, 2x as slow is probably unacceptable for real time analysis. Heck, sometimes 10% is unnacceptable, or even 1%. I’m saying there is a purpose to all these things, and that your specific example does not prove your point well in my mind.

I’m not even arguing against your point overall. I use a variety of languages often times based purely on productivity. But, I’m saying your example you used is not a great one. But, you seem pretty certain.

So, I’m wondering, if I show you that JTransforms on java 1.7 is more than 2x slower than say a more modern C++ compiler on either FFTW or MKL (which can use FFTW as it’s implementation, I just have access to it as a commercial license), what would that show you?
Daniel Lemire says:

July 25, 2012 at 10:23 am

@Keck

I’m saying your example you used is not a great one. But, you seem pretty certain.

I’m not sure how my blog post could be clearer as to how I view my example: “Of course, from a sample of a 3 compilers on single problem, I only provide an anecdote (…)”.

I have used this example because it is simple. Everyone can understand it and tweak it.

Implementing FFT efficiently is hard work. Not everyone can see what is going on. Benchmarking entire applications, for the scope of a blog post, is even crazier… nobody would ever go through my blog post.

So, I’m wondering, if I show you that JTransforms on java 1.7 is more than 2x slower than say a more modern C++ compiler on either FFTW (…), what would that show you?

It will support the following statement I made: “Really clever and hard working C/C++ programmers can beat higher-level languages by a wide margin given enough time. ”

I’ll offer you a test that could falsify the claim of my blog post. We take ten undergraduates. We ask all of them to implement, from scratch, the FFT in both Java and C++. Then we benchmark. Are you still willing to bet that the C++ versions will be more than 2x faster?

I actually predict that some of the fastest implementations will be in Java. I predict that among the flawed implementations, most of them will be C++ ones. (I also predict that if you benchmark memory usage, the Java versions will be dead.)
Daniel Lemire says:

July 25, 2012 at 10:29 am

@Keck

And the paper you link to makes my point in a different way:

We find that in regards to performance, C++ wins out by a large margin. However, it also required the most extensive tuning efforts, many of which were done at a level of sophistication that would not be available to the average programmer.

Compare with what I wrote:

Conclusion: If your sole reason for using C++ is speed, and you lack the budget for intensive optimization, you might be misguided.

That is, I am saying the same thing.

If you have a limitless budget, C++ can be much faster. No question about it. But people, in real life, don’t have limitless budget. So either you improve the algorithm (possibly gaining 10x or 100x), or you hack low-level. In real life, you can’t do both.

Sure, people can spend 20 years optimizing the FFT, and that’s important, but few people will have the luxury to optimize their code for 20 years.
Keck says:

July 25, 2012 at 10:34 am

@Daniel Lemire

I think somehow we are disconnecting but saying the same thing in some respects. I don’t disagree with your overall comments. I disagree with that specific example for a very specific reason.

Now, I do fervently believe in this quote about optimizations:
1) Don’t Optimize
2) For Experts: Don’t Optimize yet.

I’m paid to optimize, and I don’t want to do it half the time. Sometimes you have to bleed to get minuscule gains. My point is more that ‘if you have to, your options are less limited in C/C++ for getting that extra,’ and if you’re near the wall to begin with, then Java might not be the best choice. Doesn’t mean I don’t like Java. Doesn’t mean they haven’t done a lot with the VM. Doesn’t mean it’s not a great language and can most definitely be the best language for certain tasks – based on developer strengths.

I believe probably that nearly every java FFT program by an undergrad will be faster than C/C++ version. Especially true, if they are not allowed to use google – Cooley/Tukey examples are out there that are. “okay.” And, by okay I really mean bad. Or implementations of radix 2 butterfly FFT for C/C++ that if a parallel algorithm was written in java would likely be a very non-optimal solution.

That aside. I just have a problem with this specific example. Google found their findings to be very much in line with mine as well. I just feel like, yeah, it’s good, it’s getting closer in some things. C++ is still worth it, in some cases.
Daniel Lemire says:

July 25, 2012 at 11:04 am

@Keck

For the record, I hate Java as a language. I much prefer C++: I love templates. Java is overengineered and too low-level. I only program in Java because it is a lingua franca.

I say this to make it clear that I am not a Java fan boy.
Keck says:

July 25, 2012 at 11:11 am

@Daniel Lemire

I agree with it being Lingua Franca. I’ll also agree that at times it’s tough for me to see the forest for the trees, as I live in a world where I cannot have anything slow, ever. I, like you, have Mathematics and Computer Science degrees – I just ventured out of Academia after my Masters because a lot of money got thrown at me … sell out!

I love templates, especially C++11 style, and meta template programming for FFTs is delicious for small sizes. Your compile times are horrendous, but It’s definitely been a blessing on a few implementations where I couldn’t use any third party things, and I had a very tight timeline.

I’m not a Java fan boy, clearly. I’m actually a pretty “this language works for this,” fan boy. I just get paid to use C/C++ as my dominant language – usually with no choice as some things I can’t even use other languages as it’s part of the specification. I don’t disagree with your findings. It’s that particular example, and recent compilers have made a leap on that auto-vectorization. So, at this state in the game, it’s back to the VM languages to make similar ones to close that gap. Things that leveraged AVX and double precision FFTs would only make the FFT argument work.
Branimir says:

September 16, 2012 at 7:59 pm

Seems that gcc 4.7 , gcc 4.8 and clang 3.1 have bug in optimizer for 64 bit versions, as on my machine they get around 550,
while gcc 4.6.3 gets around 2000, and
java around 2200 (seems that java unrolls a bit).
That is for plain array loop, but
if vector is used, 64 bit clang cannot optimize, while 32 bit can. None of gcc can optimize vector.
If loop is written this way:
static void sum(std::vector& data) {
if(data.empty()) return;
for( unsigned i = 0; i != data.size()-1; ++i)
{
data[i+1] += data[i];
}
}

all gcc can optimize, but insert redundant code at beginning of loop slowing it down to 1600.

I have i5 3570k @ 4Ghz
Yann says:

October 3, 2012 at 6:12 pm

Hi,

Found the post very interesting: remove all complexities and see what we can conclude from what remains.

So I’ve taken the opportunity to see what Julia, an LLVM based language, could achieve. It turns out it runs at 25% the speed of C++ when coding directly, and 70% when adding some types. I didn’t try to optimize any further than that.

The code is also much easier to read, and so much faster to write.

Check it out here:

https://sites.google.com/a/hpu4science.org/hpu4science/projects/brownianCoding/integercrunchingwithjulia

Yann
Vinicius Miranda says:

October 13, 2012 at 6:09 pm

Dear,

First I would like to say that is a very good discussion. Now, may I add 3 important points to your code.

(1) size() is an expensive operation. For example, Scott Meyers says in one of his book that is always a good idea to replace size() by empty() when we just want to check if a vector has size() > 0.

(2) the operator [] in vectors has an implementation that difficult loop vectorization by compiler. According to Georg Hager – Introduction to HPC for scientists, we should always use iterators (which is = C pointer in case of vectors) in loops to allow better loop vectorization by compiler.

third: loops with this kind of dependence (that i depends on i-1 ) has a huge performance penalty according to Georg Hager, because compile cannot doop loop unrolling and vectorization.

Using this 3 ideas, why dont you try use iterator (stl always use iterator ) and a temporary. Example

double temp(0.0);

for( std::vector::iterator ia = data.begin(); ia != data.end(); ++ia)
{
temp += (*ia)
}

or something analogous.

Best
vinicius
Wilma says:

October 25, 2012 at 10:56 am

Surfing around stumbleupon.com I noticed your site book-marked as: Is C++ worth it?. Now i’m assuming you book marked it yourself and wanted to ask if social bookmarking gets you a good deal of targeted visitors? I’ve been thinking about doing some bookmarking for a few of my websites but wasn’t certain if it would produce any positive results. Appreciate it.
Daniel Lemire says:

October 25, 2012 at 11:03 am

@Wilma

I do not “bookmark” my own blog posts though I do share them through my Twitter and Google+ accounts.

I don’t think there is much point trying to spread your content through several systems. It is best to focus on writing good content.
Edward Beckett says:

October 31, 2012 at 12:11 am

I love Java … But if you want to see C++ smoke java’s ass try compiling the same code with Intel Compiler

http://software.intel.com/en-us/intel-parallel-studio-xe/
Daniel Lemire says:

October 31, 2012 at 8:23 am

@Edward

Do you have any benchmark results with the Intel compiler?
Itman says:

October 31, 2012 at 8:34 am

@Daniel,

In my experience it’s 10-20% better. I would not call it exactly ass burning.
Branimir says:

October 31, 2012 at 9:05 am

Intel fortran is pretty good but intel c++ is not better than g++ 4.7.
I have tried some benchmarks from language benchmarks site and found
that intel is slower on average than gcc.
Perhaps intel is optimized for special
cases but in general case I found it
slower.
In this particular case for example:
static void sum(std::vector& data)
{
if(data.empty()) return;
for( unsigned i = 0; i != data.size()-1; ++i)
{
data[i+1] += data[i];
}
}

bmaxa@maxa:~/examples$ icc -fast Cumul.cpp -o Cumul
bmaxa@maxa:~/examples$ ./Cumul
last 262486571 time 534.530682061
last 262486571 time 526.421073688
last 262486571 time 527.175918604
last 262486571 time 534.159500027
last 262486571 time 533.942739981
last 262486571 time 531.575590049
last 262486571 time 534.456404391
last 262486571 time 533.122927485
last 262486571 time 534.222279206
last 262486571 time 535.011128231
bmaxa@maxa:~/examples$ g++ -O3 Cumul.cpp -o Cumul
bmaxa@maxa:~/examples$ ./Cumul
last 262486571 time 1600.051201638
last 262486571 time 1523.600572874
last 262486571 time 1561.377759735
last 262486571 time 1568.479829349
last 262486571 time 1525.413387028
last 262486571 time 1579.080343608
last 262486571 time 1590.684948939
last 262486571 time 1592.305977517
last 262486571 time 1587.856076725
last 262486571 time 1578.880889226

bmaxa@maxa:~/examples$ java Cumul
last 262486571 time 2083.3333333333335
last 262486571 time 2272.7272727272725
last 262486571 time 2272.7272727272725
last 262486571 time 2272.7272727272725
last 262486571 time 2272.7272727272725
last 262486571 time 2272.7272727272725
last 262486571 time 2173.913043478261
last 262486571 time 2272.7272727272725
last 262486571 time 2173.913043478261
last 262486571 time 2173.913043478261

Java blows out of the water both icc and g++…
Edward Beckett says:

October 31, 2012 at 11:43 am

@Daniel … Personally I do not. However, I must admit, I am by no means an expert in the field of compiler optimization.

However, there is an extremely good post on the subject of branch determination that provides some insight into how the ICC compiler optimizes code. (See user mysticals answer @ http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array)

I’d love to hear your opinion on the subject post as well as apparently there are specific use cases where the ICC compiler is par or slower than GCC as @Branimar has shown …
Edward Beckett says:

October 31, 2012 at 11:54 am

@itman … I admit, using ‘smokes java’s ass’ is a poor choice of words. Nevertheless, If we were in a ten-mile marathon a ten percent win would probably qualify for the being ‘smoked’ 😉
Itman says:

October 31, 2012 at 12:51 pm

In a marathon yes. But if you are in a triathlon, being 10% faster in running does not automatically brings you a victory.
Edward Beckett says:

October 31, 2012 at 1:51 pm

@Itman … True – there’s a lot of different use cases to consider. I.E. The function of the application as in systems requiring zero-tolerance(air-traffic control, or emergency services, etc.) vs a generic user-space desktop app …
Oleg says:

January 5, 2013 at 1:38 pm

try this one (I got best time with it):
void pointerStraightsum(short * data, size_t size)
{
if(size == 0) return;
const short*const end = data + size;
short* it = data;
short s = *it++;

const short*const end8 = data + ( (size-1) / 8 ) * 8;
while ( end8 > it )
{
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
}
while ( end != it )
{
s += *it;
*it++ = s;
}
}
Oleg says:

January 5, 2013 at 2:10 pm

#include
/**
* Fast computations of cumulative sums.
* D. Lemire, July 2012
*
* Best results with:
*

$ g++-4.7 -funroll-loops -O3 -o cumulsum cumulsum.cpp

$ ./unrolldeltas

but we want to avoid the -funroll-loops flag.
*/

// This may require GCC 4.4/

#include
#include
#include
#include
#include
#include
#include
#include

using namespace std;

class WallClockTimer {
public:
clock_t start;
WallClockTimer()
{
}
void reset()
{
start = clock();
}
int elapsed()
{
return (clock() – start);
}
int split()
{
return elapsed();
}
};

short* givemeanarray_simple(size_t N) {
short* bigarray;
bigarray=new short[N];
for(unsigned int k=0;k<N;k++)
bigarray[k]=(k+k/3);
return bigarray;
}

vector givemeanarray(size_t N) {
vector bigarray;
bigarray.reserve(N);
for(unsigned int k = 0; k it )
{
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
}
while ( end != it )
{
s += *it;
*it++ = s;
}
}

void straightsum(short * data, size_t size) {
if(size == 0) return;
for (size_t i = 1; i != size; ++i) {
data[i] += data[i – 1] ;
}
}

void slowishSum(vector & data) {
if(data.size() == 0) return;
for (size_t i = 1; i != data.size(); ++i) {
data[i] += data[i – 1] ;
}
}

// By Vasily Volkov, improved by Daniel Lemire
void fastSum(vector & data) {
if(data.size() == 0) return;

const size_t UnrollQty = 4;
size_t sz0 = (data.size() / UnrollQty) * UnrollQty; // equal to 0, if data.size() =UnrollQty) {
short a = data[0];
for (; i < sz0 – UnrollQty ; i += UnrollQty) {
a = (data[i] += a);
a = (data[i + 1] += a);
a = (data[i + 2] += a);
a = (data[i + 3] += a);
}
}
for (; i != data.size(); ++i) {
data[i] += data[i- 1] ;
}
}

// loop unrolling helps avoid the need for -funroll-loops
void sum(vector & data) {
if(data.size() == 0) return;
size_t i = 1;
for (; i < data.size() – 1; i+=2) {
data[i] += data[i – 1];
data[i+1] += data[i ];
}
for (; i != data.size(); ++i) {
data[i] += data[i – 1];
}
}

void test(size_t N, const int experiments ) {
WallClockTimer time;
vector data = givemeanarray(N) ;
vector copydata(data);

time.reset();
for (int i = 0; i < experiments; ++i)
stdPartialSum(&data[0],N);
cout<<"std::partial_sum "<< N * experiments / (1000.0*time.split()) <<endl;

time.reset();
for (int i = 0; i < experiments; ++i)
pointerStraightsum(&data[0],N);
cout<<"pointer straight sum (C-like) with pointers "<< N * experiments / (1000.0*time.split()) <<endl;

time.reset();
for (int i = 0; i < experiments; ++i)
slowishSum(data);
cout<<"basic sum (C++-like) "<< N * experiments / (1000.0*time.split()) <<endl;

data = copydata;

time.reset();
for (int i = 0; i < experiments; ++i)
sum(data);
cout<<"smarter sum "<< N * experiments / (1000.0*time.split()) <<endl;

data = copydata;

time.reset();
for (int i = 0; i < experiments; ++i)
fastSum(data);
cout<<"fast sum "<< N * experiments / (1000.0*time.split()) <<endl;
}

int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
cout << "reporting speed in million of integers per second "<<endl;
size_t N = 50 * 1000 * 100 ;
const int experiments = 100;
test(N, experiments);
return a.exec();
}
Oleg says:

January 5, 2013 at 2:11 pm

/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package cumulativesum;

/**
*
* @author Oleg
*/
public class CumulativeSum {

public static void sum(int[] data) {
if(data.length == 0) return;
for( int i = 1; i != data.length; ++i)
data[i] += data[i-1];
}

public static int[] givemeanarray(int N) {
int[] bigarray = new int[N];
for(int k = 0; k<N; ++k)
bigarray[k] = k+k/3;
return bigarray;
}

public static void test(int N, int experiments) {
int[] data = givemeanarray(N);
long bef = System.currentTimeMillis();
for (int i = 0; i < experiments; ++i)
sum(data);
long aft = System.currentTimeMillis();
System.out.println( N * experiments / ( 1000.0 * (aft-bef) ) );
}

public static void main(String[] args) {
final int experiments = 100;
test( 50 * 1000 * 100, experiments );
}
}
Oleg says:

January 5, 2013 at 2:16 pm

In last 2 comments a little bit changed you code. ( N 10 times less but 100 experiments and counting average perfomance).

I tested c++ code with MinGw 4.4

Java result = 411.
C++ results:
1779
2475
410
781
2008
Oleg says:

January 5, 2013 at 5:48 pm

faster sum in java:
public static void sum(int[] data) {
if(data.length == 0) return;
int s = data[0];
for( int i = 0, end = data.length-1; i != end; )
{
s = (data[++i] += s);
}
}
Oleg says:

January 5, 2013 at 5:51 pm

faster c++ sum:
void pointerUnrolledSum(NUM * data, size_t size)
{
if(size == 0) return;
const NUM*const end = data + size-1;
NUM* it = data;
NUM s = *it;

const NUM*const end8 = data + ( (size-1) / 8 ) * 8;
while ( end8 > it )
{
s = (*(++it) += s);
s = (*(++it) += s);
s = (*(++it) += s);
s = (*(++it) += s);
s = (*(++it) += s);
s = (*(++it) += s);
s = (*(++it) += s);
s = (*(++it) += s);
}
while ( end != it )
{
s = (*(++it) += s);
}
}
Anonymous says:

January 5, 2013 at 1:34 pm

try this one:

void pointerStraightsum(short * data, size_t size)
{
if(size == 0) return;
const short*const end = data + size;
short* it = data;
short s = *it++;

const short*const end8 = data + ( (size-1) / 8 ) * 8;
while ( end8 > it )
{
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
s += *it;
*it++ = s;
}
while ( end != it )
{
s += *it;
*it++ = s;
}
}
Professor says:

January 7, 2013 at 3:21 am

(1785-2000)/1785 gives approx 12% improved performance
Dmitry Chichkov says:

January 7, 2013 at 2:47 pm

Straighter/faster C sum:

void straightersum(int * data, size_t size) {
if(size == 0) return;
int *end = data + size;
int *prev = data;
for(int *p = data + 1 ; p < end; p++)
{
*p += *prev;
prev = p;
}
}
Abdurrahim says:

January 18, 2013 at 4:11 am

C++ way of doing this is should be like this (couse you don’t modify size of data right?):
for (size_t i = 1, end=data.size(); i != end; ++i)
{
data[i] += data[i – 1] ;
}
and even more :
######################################
#include
#include
#include

using namespace std;

int main()
{
DWORD Last, Now;

volatile int * volatile data = new volatile int volatile [100000000];
ZeroMemory((void*)data, 100000000);

Last = GetTickCount();

for(int n = 0; n != 500000000; ++n)
{
for(volatile int *i = data + 1, *old = data, *end = data + sizeof(data) / sizeof(data[1]);
i != end; ++i, ++old)
{
*i = *old + 100;
}
}

Now = GetTickCount();

delete data;

std::cout << "Milliseconds passed = " << Now – Last << std::endl;

return 0;
}
in vs2008 gave me : 422 milliseconds.

#######
and java here:
################
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package javaapplication1;

import java.util.Date;

/**
*
* @author Abdurrahim
*/
public class JavaApplication1 {

/**
* @param args the command line arguments
*/
public static void main(String[] args) {
// TODO code application logic here

int[] data = new int[100000000];

long Last = System.currentTimeMillis();

for(int n = 0; n != 500000000; ++n)
{
for (int i = 1; i != data.length; ++i) {
data[i] += data[i – 1];
}
}

long Now = System.currentTimeMillis();

System.out.println(String.format("Milliseconds passed = %d",
Now – Last));
}
}
####
Still runs :)!

if i give n 50000 and sizeof data 20000 java gave me 502 ms run.
Now i ask you 'is-cc-worth-it' ?
Dominic Amann says:

January 18, 2013 at 11:10 am

You are all comparing trivial operations that can easily be optimized by either c++ or java.

Doing real world work that involves serious math and memory allocations on the stack and the heap would be a much more useful test, such as an FFT operation on a 2d array of amplitude data.
Dominic Amann says:

January 18, 2013 at 12:23 pm

I know it was discussed – I was just responding to the latest post. I also mentioned memory allocation as a vital factor in comparing the languages.

Plus no-one has shown me that some Java library is actually not just a wrapper for an underlying c library. Here I am just asking.
Daniel Lemire says:

January 18, 2013 at 11:31 am

@Dominic Amann

Please read the thread of comments, we do discuss the FFT specifically.
Itman says:

January 18, 2013 at 4:58 pm

@Abdurrahim, there is apparently a bug in your code. Try print:

sizeof(data) / sizeof(data[1])

You code would have been even faster if you had a 32-bit CPU :-))

This is why a language like Java is so much better than C++! It is harder to shoot yourself in the foot.
Abdurrahim says:

January 19, 2013 at 5:43 pm

@Itman
You are right! ‘shoot yourself in the foot’.. yes i agree. I use java too but when i came performance issue java is not much good choice. I can say in c++, only adding ‘#pragma omp …’ can make your loop multi-threaded and this is only one thing about performance… Anyway c++ is a minefield and i hate it this point.
One more thing :
‘Most C++ programmers would implement it as follows:’
memmove (data, data+1, size – 1);
Thanks.
Ivan Dobrokotov says:

January 20, 2013 at 7:14 am

Run every function 150 times, parsed output of measurement, got following plot: http://dobrokot.ru/pics/i2013-01-20__16-15-15_53kb.png

As you can see, there is strange clustering of times.

Second strange thing: “basic sum (C++)” has the same performance as smartest hand-unrolled implementations. g++ 4.3.4 is used.

Plot code: https://gist.github.com/4578192

…..
for (int i = 0; i < 10; ++i) {
cout<<"============"<<endl;
test(N);
}
…..
void test(size_t N ) {
for(int t = 0; t<15;++t) {
cout<<":straight sum (C-like) //prefix each by colon ":"
…..
Ivan Dobrokotov says:

January 20, 2013 at 7:16 am

Median is used for estimation (not mean).
Ivan Dobrokotov says:

January 20, 2013 at 7:39 am

cout<<"X" << time() – POSSIBLE MEASUREMENT OF CONSOLE OUTPUT!!!
7 says:

January 21, 2013 at 12:57 am

Holy hell, a language war going on for months now, over a narrow scope benchmark that is anything but conclusive.

The Computer Language Benchmarks Game shows a much more conclusive result – in terms of CPU time Java is anywhere from a tad slower to twice as slow than GCC, which BTW is far from being the most aggressively optimizing compiler. Intel’s compiler is about 10-20% faster than GCC, much better loop unrolling, much better vectorization, that is why it is also the slowest C++ compiler in terms of compile time, it does a lot of compile time work to ensure the best possible performance during runtime.

But then again, we have memory efficiency – where Java is simply awful. It often uses 20-30 times the amount of memory that C++ uses. In a benchmark scenario this might not be an issue, but in actual real world multi process scenario that memory consumption may easily stack up to exceed the capacity of the system and force the machine to swap, which will cause Java performance to plummet, bottlenecked by the dreadfully slow HDD.

So we have a tad better performance, huge memory efficiency advantage and better platform support – C++ can compile to native binary for almost any processor out there, as for Java – there are plenty of platforms that Java has no VM and runtime implementations for. Also C++ doesn’t enforce OOP, which I happen to like.

So regard the facts and stop this pointless arguing over a pointless benchmark.
Daniel Lemire says:

January 21, 2013 at 8:32 am

@7

Holy hell, a language war going on for months now, over a narrow scope benchmark that is anything but conclusive.

The Computer Language Benchmarks Game shows a much more conclusive result â€“ in terms of CPU time Java is anywhere from a tad slower to twice as slow (…)

To be fair to my post, I state that the example is nothing but an anecdote, I even repeat it later (I do not, nor have I ever, based my conclusions on 3 lines of code) and then I link to the Benchmark Game.
Abdurrahim says:

January 21, 2013 at 11:57 am

Sorry about one thing: I did not see ‘+’ sign so i said about using ‘memlame’. What a shame 😀 i apologise!
Foobar says:

January 14, 2017 at 3:40 am

If it weren’t for manual memory management, C++ would just be another shit Java mock up. Its main point is to solve allot of problems for business apps, unfortunately no one uses it for anything other than OOP.

Java is what you get when you strap a cement tow to a forklift.

C++ is what you get when you strap an old geezer into a rocket car.

.Net is just Pagan shit!
Max says:

February 25, 2017 at 3:29 pm

I’m very late to the discussion, but I wrote a c++ version that’s twice as fast as anything here so I thought it was worth sharing.

void maxSum(int * data, size_t size) {
if(size == 0) return;
int tmp = 0;
for (size_t i = 0; i != size; ++i) {
tmp += data[i] ;
}
data[size-1] = tmp;
}

Your compiler just can’t vectorize your loops if you’re changing the value you’re going to access on the next iteration. G++ seems to have a much easier time figuring what’s you’re trying to do if you’re storing the sum outside the loop.

with -O3 -funroll-loops, I get about 3500 millions operations per second versus about 1700 for the c-ish version.
1. Daniel Lemire says:
  
  February 25, 2017 at 3:33 pm
  
  I don’t think your code solves the same problem.