This effect is due to possible aliasing. The compiler can’t be sure the array and the passed in value don’t overlap (i.e., that the referenced value isn’t one of the array elements). In the case of aliasing, the semantics of the two loops are not the same, since the value i will be modified by the assignment (once) in addition to the increment (maybe it’s even UB due to double-modifcation w/o sequence point). So the compiler emits this conservative code which reloads the value each time. In addition to slower code due to the in-memory increment, optimizations like vectorization are inhibited.
Travis Downssays:
Possible fixes:
1) Copy i into a local before the loop. This way the compiler knows aliasing isn’t possible since i is an unescaped local. You’ll have to copy i back at the end to maintain the semantics of incrementing k times.
2) Put the __restrict qualifier on eitheri or x, or both. Roughly speaking, this tells the compiler the values don’t overlap. It isn’t standard in C++, but is relatively widely supported, perhaps because it is standard in C.
3) Use different types. If the type pointed to byx is different than the type of i, most compilers use type based aliasing analysis (TBAA) to prove aliasing doesn’t occur.
The compiler could also save you, e.g., by checking whether aliasing occurs at runtime, eg checking whether the address of i falls in the range x – x + k. I’m fact, both clang and gcc do this for slightly simpler cases, where i is not incremented: if the runtime check passes, the fast loop is used, otherwise it falls back to the conservative one.
In my opinion, this is trickier than it sounds. Naively, I would think that ‘counter’ in the example below is of a type distinct from ‘uint64_t’… but I see the same bad optimization effect, possibly because the standard does not allow the compiler to forbid aliasing of this sort. And this says nothing about real C++ code where it is not always immediately obvious what the actual type is (due to the tendency of C++ programmers to abstract away types).
void incrementj(counter & i, uint64_t * x){
for(int k = 0; k < 1000000; k++) {
x[k] += i.val++;
}
}
mesays:
Does using a local copy of i, and copying the final value back to the reference make a difference? So uint64_t j = i; for (; j < size; j++) ... i = j;?
You mention that Swift is likely affected by this, but it’s not. Swift enforces at a language level that inout parameters don’t alias. Usually at compile time, but with runtime precondition checks when it can’t prove it statically.
Swift code currently has other overhead that makes this function bigger than it’s C equivalent but those are separate issues. Largely due to the fact that an array in Swift is not just a pointer to data, but is in fact a copy-on-write ref-counted type that knows its own size and capacity and can change its size to fit its elements.
I make no claim regarding aliasing. I make a claim regarding performance.
My claim is that this is going to be slower…
func fx(_ allints: inout [Int], _ j : inout Int) {
for k in allints.indices {
allints[k] = j
j &+= 1
}
}
than the following…
func f(_ allints: inout [Int], _ i : inout Int) {
var j = i
for k in allints.indices {
allints[k] = j
j &+= 1
}
i = j
}
Now, admittedly, there is no fundamental reason for it when the array is large. Even in C++, the compiler could add a check for overlap at the start (knowing that the loop is quite long) and optimize it away. My point in this post is that current compilers (at least some popular ones) will let you down.
$ swift -O reference.swift
ref 1.892044 ns
hard 0.554002 ns
Basically, Swift exhibits exactly the same performance characteristics as C++ with GNU GCC 8. In fact, if you convert the nanoseconds into CPU cycles, you get very nearly the same performance down to a fraction of a cycle.
I do not doubt that Swift “doesn’t alias”, as you explain… but it does not follow that the two functions have the same performance once compiled (with optimizations).
Michaelsays:
That’s quite bizarre, as inout variables are defined to be semantically exactly the same as copying the value to a local variable and copying the new value back to the original variable at the end of the function definition, so according to the rules of the language these functions should generate the exact same code.
Do you mind if I include your benchmark in a bug report on the Swift compiler?
This effect is due to possible aliasing. The compiler can’t be sure the array and the passed in value don’t overlap (i.e., that the referenced value isn’t one of the array elements). In the case of aliasing, the semantics of the two loops are not the same, since the value i will be modified by the assignment (once) in addition to the increment (maybe it’s even UB due to double-modifcation w/o sequence point). So the compiler emits this conservative code which reloads the value each time. In addition to slower code due to the in-memory increment, optimizations like vectorization are inhibited.
Possible fixes:
1) Copy
i
into a local before the loop. This way the compiler knows aliasing isn’t possible sincei
is an unescaped local. You’ll have to copyi
back at the end to maintain the semantics of incrementingk
times.2) Put the __restrict qualifier on either
i
orx
, or both. Roughly speaking, this tells the compiler the values don’t overlap. It isn’t standard in C++, but is relatively widely supported, perhaps because it is standard in C.3) Use different types. If the type pointed to by
x
is different than the type ofi
, most compilers use type based aliasing analysis (TBAA) to prove aliasing doesn’t occur.The compiler could also save you, e.g., by checking whether aliasing occurs at runtime, eg checking whether the address of
i
falls in the rangex
–x + k
. I’m fact, both clang and gcc do this for slightly simpler cases, wherei
is not incremented: if the runtime check passes, the fast loop is used, otherwise it falls back to the conservative one.In my opinion, this is trickier than it sounds. Naively, I would think that ‘counter’ in the example below is of a type distinct from ‘uint64_t’… but I see the same bad optimization effect, possibly because the standard does not allow the compiler to forbid aliasing of this sort. And this says nothing about real C++ code where it is not always immediately obvious what the actual type is (due to the tendency of C++ programmers to abstract away types).
Does using a local copy of
i
, and copying the final value back to the reference make a difference? Souint64_t j = i; for (; j < size; j++) ... i = j;
?Right: creating a local copy helps the compiler.
You mention that Swift is likely affected by this, but it’s not. Swift enforces at a language level that inout parameters don’t alias. Usually at compile time, but with runtime precondition checks when it can’t prove it statically.
Swift code currently has other overhead that makes this function bigger than it’s C equivalent but those are separate issues. Largely due to the fact that an array in Swift is not just a pointer to data, but is in fact a copy-on-write ref-counted type that knows its own size and capacity and can change its size to fit its elements.
I make no claim regarding aliasing. I make a claim regarding performance.
My claim is that this is going to be slower…
than the following…
Now, admittedly, there is no fundamental reason for it when the array is large. Even in C++, the compiler could add a check for overlap at the start (knowing that the loop is quite long) and optimize it away. My point in this post is that current compilers (at least some popular ones) will let you down.
I took a few seconds to write a Swift version of the benchmark…
Basically, Swift exhibits exactly the same performance characteristics as C++ with GNU GCC 8. In fact, if you convert the nanoseconds into CPU cycles, you get very nearly the same performance down to a fraction of a cycle.
I do not doubt that Swift “doesn’t alias”, as you explain… but it does not follow that the two functions have the same performance once compiled (with optimizations).
That’s quite bizarre, as inout variables are defined to be semantically exactly the same as copying the value to a local variable and copying the new value back to the original variable at the end of the function definition, so according to the rules of the language these functions should generate the exact same code.
Do you mind if I include your benchmark in a bug report on the Swift compiler?
Please do. Feel free to refer to my blog post as well. 🙂