14th June 2011, 25 min read

The language interpreters are the new machines

32 thoughts on “The language interpreters are the new machines”

Daniel Lemire says:

June 14, 2011 at 3:48 pm

@Jib

I suspect your experience is common. I also get into this sort of trouble in C++: the optimizer beats my hard work routinely.

My point is that this is even more true in Python or JavaScript.
Daniel Lemire says:

June 14, 2011 at 5:39 pm

@Dan I have updated my blog post with a link to a C++ test which shows that a single pass through the data is, as expected, about twice as fast.
Daniel Lemire says:

June 14, 2011 at 7:13 pm

@mr foo it

It is much better if you try to reproduce my results which you can do in seconds if you have a Mac or a Linux box. If not, you just need to install Python which takes minutes.
Daniel Lemire says:

June 14, 2011 at 7:14 pm

@Will I need to learn Go.
Jib says:

June 14, 2011 at 3:15 pm

I actually ran into this over 15 years ago with C++ compilers. Optimizer’s were relatively new and improving rapidly from version to version. My really cool, highly efficient roll you own algorithms often ‘broke’ the optimizer in that the optimizer could not figure out what I were trying to do and could not help me. Over one 2 year process I found my code going from very fast compared to ‘dumb’ implementations to slower than ‘dumb’ implementations.

The key was that the optimizers were taking advantage of the full instruction set of what was then state of the art processors where as my algorithms assumed a much simpler architecture.

Damn you Knuth!
Daniel Lemire says:

June 14, 2011 at 8:34 pm

@Rafael

Thanks but your function, according to my tests, is several times slower than array.index(max(array)). In fact, it is slower than any of the alternative, on my machine. Can you publish your numbers and your code?
Daniel Lemire says:

June 14, 2011 at 8:39 pm

@Hoffman

I did not know about the array package: thanks! Unfortunately, I don’t get any speed gain by using a Python array instead of a list.
Iain says:

June 14, 2011 at 3:46 pm

An aside not relevant to your main point: when using numpy, arrays have an argmax() method, which is much faster than any of these solutions. Actually I think numpy reinforces your point: I’m often doing multiple passes over arrays when not strictly necessary because it vectorizes better in numpy or Matlab.
Itman says:

June 14, 2011 at 4:07 pm

@Jib, if a compiler can use SSE operations, then it can beat you. However, if you use SSE operations explicitly, you are likely to beat the compiler.

@Daniel, this is a rather common issue nowdays. Loops are often prohibitive in script languages, because a built-in function may be orders of magnitude faster.
Itman says:

June 14, 2011 at 4:11 pm

PS: However, this is not a new issue. Consider, e.g., a SQL with procedural extensions.
Dan says:

June 14, 2011 at 4:21 pm

Two reasons why this could be the case:

If array.index() and max() are builtin functions, rather than interpreted, in a non-JIT language (like the Python 3 interpreter), then they will likely run much faster than the interpreted equivalent, even though the “algorithm” used for the others is more efficient. Presumably, this gap would close as the language interpreter gets faster or JITed.

Second is that even in lower level languages like C, this “may” be the case. If the array.index() and max() functions run in a way that is very predictable for CPU branch prediction and makes efficient use of the processor cache (cache is properly prefetched) and the alternative has lots of sporadic branches and/or accesses data in unpredictable ways (so that there is little or no prefetching), as I suspect is the case with list comprehension and traversal (compared to simply scanning an array whose elements are beside each other; lists are probably linked lists – a notoriously cache-unfriendly data structure), then the code which technically does more work would still be faster.

tl;dr: modern computers and languages are complex!
mr foo it says:

June 14, 2011 at 6:17 pm

You only ran each pass once!? or even if you did it multiple times you only posted averages!?
In order for the data to be useful we also need you to post the individual samples so we can run it under statistical tests.
Will Fitzgerald says:

June 14, 2011 at 6:31 pm

func ArgMax(arr []int) (index int, max int) {
for i,n := range arr {
if i==0 || n > max {
index = i
max = n
}
}
return
}

func main() {
array := []int{1,2,3,4,5,6,7,6,5,4,11113,2,1}
i,max := ArgMax(array)
println(i)
println(max)
}
Charles Nicholson says:

June 14, 2011 at 7:09 pm

Accumulating the maxarg as the array is populated and transformed gives you a O(1) query later. Of course it doesn’t work if you remove items, but if maxarg is your bottleneck, that might be a valuable constraint to impose.

Alternately, if maxarg is truly your bottleneck and you do remove elements from the array, then consider spending more memory and maintaining a heap of indices to array elements, sorted by their array element value, for constant-time maxarg query at the cost of O(logn) add/remove.

Holistic algorithm enhancement like that can often trump all other solutions- the ideal number of times to scan the array is 0, right? 🙂

Put another way, if you’re optimizing the wrong aspect of the problem, who cares which approach is faster?
Michael Hoffman says:

June 14, 2011 at 7:38 pm

Python has some easy-to-use objects, but the overhead of dealing with them–dictionary accesses, function dispatch overhead, object creation–will often overwhelm any potential algorithmic advantages. To speed things up, usually you want to avoid these as much as possible. That’s why the array.index(max(array)) solution works so well. I don’t call my lists “array,” however because there is also an array type in Python (in the array module). If you make “array” an array.array instead of a list, this is even faster on my benchmark.

In the real world, NumPy is often the best way to eliminate this problem. If you convert array to a NumPy array and use its argmax() function, it is quite fast indeed.
Rafael says:

June 14, 2011 at 8:25 pm

I think you left out the most obvious definition of it:

def maxarg(arr):
counter = 0
arg = 0
m = arr[0]
for x in arr:
if x > m:
m = x
arg = counter
counter += 1
return arg

which run about as fast as array.index(max(array)) (on python 2.6, at least). Sometimes it’s a bit slower, sometimes a bit faster. Which is indeed surprising, although understandable. It’s the language overhead. Your solutions have a huge amount of overhead (some of they may even be running at O(2n)), that’s why they are that much slower.

BTW, that code that runs on javascript, does it run in the client? If so, it’s pretty obvious why it isn’t the overhead. Otherwise, it seems to me like a silly choice and it really impress me if the javascript is not the bottleneck if they do anything other than read and write files (very low on processing) with it.
Michael Hoffman says:

June 14, 2011 at 8:33 pm

I don’t get that result at all, Rafael. On Python 2.6.5, using the script at https://gist.github.com/1026313:

The index/max version takes 1.62 s, but the maxarg() version takes 5.09 s.
Will Fitzgerald says:

June 15, 2011 at 1:08 am

Daniel, here’s a better formatted sketch of ArgMax in Go — both a ‘generic’ version and one specific for ints, that also returns the max value

https://gist.github.com/8856a9ce3b18c7550db7
Dan says:

June 15, 2011 at 2:11 am

@Daniel: Of course that code will be much faster – its not doing the same thing as what I suspect the Python version (even if Python were generating equally optimal code) is doing.

Your code is running once through an array (whose elements are beside each other), which can be easily prefetched. Your index(max()) version runs through the array twice, which means that the memory (since N elements won’t all fit into the cache at once) must be loaded at least a second time. Of course the single traversal will be about twice as fast.

The Python version however is performing a list comprehension, which presumably (though I could be wrong) is doing more work than simply traversing the list (ie building a second list as it traverses the first) – much more work than your C++ implementation!

The real reason theres such a speed difference in Python is builtins vs not builtins though.
PaweÅ‚ Lasek says:

June 15, 2011 at 2:50 am

I think that the new, higher level languages still require good knowledge of how the machine works at lower level. The difference is that you should always know your language and runtime as well – something cheerfully omitted in CS courses and language-learning books.

Because of that I’ve been suggesting to my professors to make assembly part of required courses. Not because we will use it to program, but because it opens a way to understand just what happens inside the runtime, instead of treating it like magic in hands of a new apprentice/novice.
Daniel Lemire says:

June 15, 2011 at 1:48 pm

@Rafael

Thanks for posting your code and timings. I now realize I was wrong to compare a random shuffle (C++) with actual random bits (Python). Therefore, I have updated my blog post with a random shuffle like yours.

As you can see from my updated numbers, your function is still slower. However, if I revert back to Python 2.6, the gap is not so large. But it still faster (on average) to do array.index(max(array).
Daniel Lemire says:

June 15, 2011 at 1:51 pm

@Itman

I will come back to this topic. Yes, I think that comparing hash tables with linear scan is interesting.
Eric LaForest says:

June 15, 2011 at 4:50 pm

@Dan: the cost of traversing an array depends on where it is in the memory hierarchy, which is not accounted for in typical algorithmic analysis.

For example, you could tile the search for max and its index by iterating over a segment of the array that fits in cache, finding the max and its index of that segment, replacing the old values if necessary.

This approach does go over the array twice, but only in cache. The array is gone over only once in main memory.

It would be very interesting to figure out at which point in the problem size does this optimization ceases to outweigh the O() behaviour of a theoretically better algorithm.
Rafael says:

June 15, 2011 at 1:01 pm

@Daniel here at work my times were a little different than at home. However, the difference was never as big as yours. Perhaps doing a shuffle and creating a string like you did makes the difference, I don’t know. That wouldn’t make much sense.

$ python –version
Python 2.6.5
$ python test.py
1.22238516808
3877433
1.50368499756
3877433
$ python test.py
1.1917090416
3333202
1.50043797493
3333202
$ python test.py
1.34272694588
5674904
1.49921607971
5674904
$ python test.py
1.08860707283
1109969
1.53280496597
1109969

Code is at: http://rafael.kontesti.me/test.py
Rafael says:

June 15, 2011 at 1:04 pm

Actually, I forgot I had set .py extension to be executed in that directory. Please refer to it at http://rafael.kontesti.me/test.txt
Itman says:

June 15, 2011 at 1:17 pm

@All, who started posting the code and compare times,

In my opinion, you miss the main point. In this or another case, your mileage may vary. However, a much simpler algorithm that seems to be inefficient may easily beat a more sophisticated once due to hardware and software specifics.

Examples are numerous! To be more conclusive, IMHO, Daniel should have posted an example of a hash vs unfolded search loop. In C++!!! It is very instructive to see that loops, especially unfolded ones, do beat hashes on small data sets. Same thing for binary searching.

High level script languages are often even worse with this respect. However, the general intuition is that built-in function often work faster than loops. Therefore, for mission-critical program pieces, one should measure performance to select the best variant.
Itman says:

June 15, 2011 at 1:55 pm

@Daniel,
It is not just hash tables. A lot of fancy tree structures (R-trees, KD-trees, BTKs, even regular binary trees), perform best when elements close to leaves are stored in buckets and are sought sequentially.
Will Fitzgerald says:

June 15, 2011 at 2:27 pm

So, running Rafael’s benchmark, but setting the last value to the max:

arr = range(10000000)
random.shuffle(arr)
arr[10000000-1]=10000000+1

argmax consistently beats out index(max)

I also see it consistently winning on his benchmark. BUT NOT IF THE ARRAY IS UNSHUFFLED.

This implies it’s not (just) the builtins, but other, perhaps CPU related, things going on.

One of the disadvantages here is *not* being able to peek at the assembly.
Matteo Dell'Amico says:

June 22, 2011 at 12:03 pm

With pypy, which packs a JIT, your version is way faster than array.index(max(array)) (on my machine, by a factor 4).

It’s quite well known by Python programmers that CPython has much faster “implicit” loops (e.g., max(l), l.index()) than “explicit” ones (for x in l), because implicit ones manage to avoid many of the interpreter overhead. All these things change, though, when you have a JIT.
Roman Shapovalov says:

July 8, 2011 at 2:28 am

Nice point.
It’d be interesting to compare what is faster — “native” C or C++ STL implementation like this:

std::max_element(vec.begin(), vec.end()) – vec.begin();
Roman Shapovalov says:

July 8, 2011 at 2:30 am

Also, how does it scale with changing array size?
PS. Your CAPTCHA is really annoying. 🙂
Randy A MacDonald says:

January 8, 2012 at 10:31 pm

This APL programmer suggests that using a higher level language brings one _closer_ to the idealized machine.