Sounds like it’s an architecture identification bug. If you can replicate it with gcc-8.1 (or even better, the Git HEAD), report it on GCC’s bug tracker: https://gcc.gnu.org/bugzilla/
Travis Downssays:
It’s not a bug per se, because it happens when GCC is too old to know about the new arch. So it doesn’t happen (for Skylake) on newer GCC, but it would presumabley still happen with a newer CPU uarch.
GeorgeLsays:
Maybe it depends on your operating system and GCC version. On CentOS 7.5 with native GCC 4.8.5 and even with GCC 8.2 RC setting march=native also means mtune=native is set
You need to run the test with a compiler that doesn’t know about your arch to make this interesting. In particular, for gcc 8 your results are as expected: Haswell is known by gcc and you are running on Haswell, so you get march and mtune set to Haswell.
For the gcc 4.8.5 test, it isn’t clear what it means: core-avx2 is no longer a supported option for gcc (at least according to the manual): it reminds me of the icc options? It doesn’t make sense to tune for “core-avx2” since that is not an micro-architecture, so it’s hard to say what gcc is doing internally. Perhaps this behavior changed in later versions of gcc.
GeorgeLsays:
For the gcc 4.8.5 test, it isn’t clear what it means: core-avx2 is no
longer a supported option for gcc (at least according to the manual):
it reminds me of the icc options? It doesn’t make sense to tune for
“core-avx2†since that is not an micro-architecture, so it’s hard to
say what gcc is doing internally. Perhaps this behavior changed in
later versions of gcc.
Ah didn’t realise core-avx2 was no longer supported. Probably explains why i had issues compiling PHP 7.3 alphas – on Skylake cpu failed to compile with Zend Opcache on GCC 4.8.5 but compiled fine on GCC 7.3.1 🙂
It could be clearer: what it should say is that “Specifying -march=cpu-type implies -mtune=cpu-type if not otherwise explicitly specified.” I had always interpreted it that way, but probably because before reading it I had seen lots of examples where both are specified (indeed, the documentation hints at that usage).
That is, it has always been the case that passing both -march and -mtune to the same compilation makes sense: you often want to target some fairly broad range of chips (say, since Sandy Bridge) but optimize for the chip you know will be the most common in your case in the immediate future (say Skylake).
You can see some method to gcc’s madness here. When you specify that gcc should use instructions and tuning for your arch, but you run into a problem when the arch is newer than gcc knows. In that case, what gcc does is different for the “march” side of things versus the “mtune”.
For the march, you are just talking about available instructions and instruction sets. Any version of GCC knows about some set of instruction sets, usually corresponding to the newest arch it knows about. It can also query the instruction sets supported by the current CPU. If it as unknown type, it could match it against the arches it knows about and if there is an exact match or a “superset match” it could just use that – and so it does: it selects Broadwell since from an ISA point of view, Skylake is Broadwell (Skylake may support a few extra instructions such as MPX, but since gcc doesn’t know about them, it wouldn’t query for them and so this logic probably gets the same result whether it is using exact match or superset match).
Another way of looking at it is that -march=broadwell is just a shortcut for specifying a long list of -m options like -mavx, -mavx2, -mpclmul, etc, and the same list can be generated for -march=native by querying the processor’s capabilities, which may then be compressed to something like -march=broadwell if it matches the list implied by Broadwell.
All this is good because it prevents a huge regression when using -march=native: if it didn’t do this when you upgraded your CPU you’d suddenly lose access to AVX2, AVX, any version of SSE greater than 2 and so on, since gcc would just be like “Oh, I don’t know about this CPU so I’ll use the based x86-64 profile”. So I think we can say gcc is doing a reasonable thing on the -march side of things.
That leaves -mtune. The main problem as you put is that -march=native implies (for example) -mtune=broadwell on Skylake chips when gcc doesn’t know about Skylake, but it does not imply -mtune=broadwell. In fact, in this particular case, -mtune=broadwell would be the best option: -mtune=generic is worse.
We know that, however, only with the benefit of hindsight: Skylake performs very much like Broadwell (which performs essentially identical to Haswell before it), so Broadwell is a good tune for Skylake. That certainly hasn’t always been the case though: when the switch to the P4 uarch was made, the tune for the “previous” arch would have been a bad match for P4, and same when P4 was in turn dropped in favor of a return to the PPro/PentiumM architecture.
So the rule of “use the latest arch (from same manufacturer?)” would have worked well recently but not in the past. It would also have trouble when some manufacturer doesn’t have a linear list of architectures, but rather also has various secondary archictectures, like Intel with Atom and the Phi/Knights* stuff.
The rule of “use generic tune” seems like a reasonable compromise, and also has the advantage of being easier to implement: no need to implement an ordering of architectures or deal with the various families etc. So even though I originally thought this was really dumb, I can see the logic.
Last note. You write:
By default, when unspecified, “-mtune=generic†applies which means…
I think you know this, but one should be clear that this only applies if you don’t also specify -march. Usually you want to specific -march since the difference there is huge: newer instruction sets, and -mtune comes along for the side.
Travis Downssays:
I hate no editing capabilities, and this typo is too important: it should read:
The main problem as you put is that -march=native implies (for
example) -march=broadwell on Skylake chips when gcc doesn’t know about
Skylake, but it does not imply -mtune=broadwell
Thanks. This is an appropriate and timely bit of information, given my upcoming exercise. 🙂
I can somewhat understand the choice of compiler-default behaviors, but also expect it might wander a bit between versions. This should not matter for most folk, for most problems, but if you are working a problem targeted for a specific processor, this stuff matters.
For the longest time, a codebase I worked on had -march=native -mtune=native. It was just easier to let GCC figure things out instead of specifying the actual values, and it worked, so why bother?
But it does. And this article is a great link to share with people who don’t know that.
The reason I had to change the code base was virtual machines. Some of the build was being done in a QEMU VM, so the CPU returned from procinfo was a QEMU. This broke the build entirely, since GCC couldn’t figure out what the CPU architecture was. But if it hadn’t been for that, I would not have been aware of the issues with -march=native -mtune=native. So thank you for writing the article to bring this to more people’s attention.
If the compiler does not know the actual architecture – you mentioned that broadwell is not correct, just close enough – how is it going to know that tuning for broadwell is more appropriate than tuning generic? Because apparently it is not a broadwell.
It seems consistent to me apply generic tuning for a CPU that the compiler does not (yet) have enough details. It cannot just assume that broadwell tuning is the best choice for all future broadwell successor CPUs.
It seems consistent to me apply generic tuning for a CPU that the compiler does not (yet) have enough details.
It is not wrong, but I would argue that it is not possible to infer this behaviour from the documentation. So the net result is a surprise, and surprises are not good.
Quentin Nsays:
One of the longest running threads in compiler development, this is a great post with the key question asked, some valuable introspection tools, and the general state of things explained
The two key discussions are 1) march is generally incrementally inclusive across processor models/capabilities, and 2) the tools themselves adapt over time to the available models.
Worth noting that the underlying tools (assembler, linker) can be sensitive to these variables.
I wish gcc and clang would both auto-generate docs to show the tune/arch/HW (and if dependent on the OS) decision tree. Maybe I need to pony up some open source development effort…
Aaron Max Feinsays:
Great thread indeed, very cool to get a better grip on this… was making the same assumptions and occasionally wondered about it… 🙂
Martin Guttmansays:
I find it to be more of a documentation broad wording issue and not a bug per se. Where it says :
It exactly means cpu-type, not attribute-option. Since native it’s not a cpu-type but rather a compiler instruction to try to match the current architecture, it does not cascade to the -mtune option, and is well within the wording. The confusing wording, but correct one.
I am not sure I ever believed it was a bug. It is just complicated.
Mingye Wangsays:
For what is worth, on godbolt’s x86-64 gcc 13.2, “-march=native –help=target -Q” now gives whatever CPU the server happens to be using in “-mtune”. Using the available versions I found that GCC 7.2 gives generic mtune, but GCC 7.3 does native. I am a bit too lazy to find the commit for now.
Sounds like it’s an architecture identification bug. If you can replicate it with gcc-8.1 (or even better, the Git HEAD), report it on GCC’s bug tracker: https://gcc.gnu.org/bugzilla/
It’s not a bug per se, because it happens when GCC is too old to know about the new arch. So it doesn’t happen (for Skylake) on newer GCC, but it would presumabley still happen with a newer CPU uarch.
Maybe it depends on your operating system and GCC version. On CentOS 7.5 with native GCC 4.8.5 and even with GCC 8.2 RC setting march=native also means mtune=native is set
On Core i7 4790K cpu
with GCC 4.8.5 native
gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC)
you get for march and mtune
gcc -march=native -Q --help=target | egrep -- '-march=|-mtune' | cut -f3
core-avx2
core-avx2
with GCC 8.2 RC snapshot reported as 8.1.1 right now
gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/gcc-8.2.0-RC-20180719/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/opt/gcc-8.2.0-RC-20180719 --disable-multilib --enable-bootstrap --enable-plugin --with-gcc-major-version-only --enable-shared --disable-nls --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-install-libiberty --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++ --enable-initfini-array --disable-libgcj --enable-gnu-indirect-function --with-tune=generic --build=x86_64-redhat-linux --enable-lto --enable-gold
Thread model: posix
gcc version 8.1.1 20180719 (GCC
you get for march and mtune
gcc -march=native -Q --help=target | egrep -- '-march=|-mtune' | cut -f3
haswell
haswell
and specifically for haswell target you get for march and mtune
gcc -march=haswell -Q --help=target | egrep -- '-march=|-mtune' | cut -f3
haswell
haswell
You need to run the test with a compiler that doesn’t know about your arch to make this interesting. In particular, for gcc 8 your results are as expected: Haswell is known by gcc and you are running on Haswell, so you get march and mtune set to Haswell.
For the gcc 4.8.5 test, it isn’t clear what it means: core-avx2 is no longer a supported option for gcc (at least according to the manual): it reminds me of the icc options? It doesn’t make sense to tune for “core-avx2” since that is not an micro-architecture, so it’s hard to say what gcc is doing internally. Perhaps this behavior changed in later versions of gcc.
Ah didn’t realise core-avx2 was no longer supported. Probably explains why i had issues compiling PHP 7.3 alphas – on Skylake cpu failed to compile with Zend Opcache on GCC 4.8.5 but compiled fine on GCC 7.3.1 🙂
A note about the gcc documentation you mentioned:
It could be clearer: what it should say is that “Specifying -march=cpu-type implies -mtune=cpu-type if not otherwise explicitly specified.” I had always interpreted it that way, but probably because before reading it I had seen lots of examples where both are specified (indeed, the documentation hints at that usage).
That is, it has always been the case that passing both
-march
and-mtune
to the same compilation makes sense: you often want to target some fairly broad range of chips (say, since Sandy Bridge) but optimize for the chip you know will be the most common in your case in the immediate future (say Skylake).You can see some method to gcc’s madness here. When you specify that gcc should use instructions and tuning for your arch, but you run into a problem when the arch is newer than gcc knows. In that case, what gcc does is different for the “march” side of things versus the “mtune”.
For the march, you are just talking about available instructions and instruction sets. Any version of GCC knows about some set of instruction sets, usually corresponding to the newest arch it knows about. It can also query the instruction sets supported by the current CPU. If it as unknown type, it could match it against the arches it knows about and if there is an exact match or a “superset match” it could just use that – and so it does: it selects Broadwell since from an ISA point of view, Skylake is Broadwell (Skylake may support a few extra instructions such as MPX, but since gcc doesn’t know about them, it wouldn’t query for them and so this logic probably gets the same result whether it is using exact match or superset match).
Another way of looking at it is that
-march=broadwell
is just a shortcut for specifying a long list of-m
options like-mavx
,-mavx2
,-mpclmul
, etc, and the same list can be generated for-march=native
by querying the processor’s capabilities, which may then be compressed to something like-march=broadwell
if it matches the list implied by Broadwell.All this is good because it prevents a huge regression when using
-march=native
: if it didn’t do this when you upgraded your CPU you’d suddenly lose access to AVX2, AVX, any version of SSE greater than 2 and so on, since gcc would just be like “Oh, I don’t know about this CPU so I’ll use the based x86-64 profile”. So I think we can say gcc is doing a reasonable thing on the-march
side of things.That leaves
-mtune
. The main problem as you put is that-march=native
implies (for example)-mtune=broadwell
on Skylake chips when gcc doesn’t know about Skylake, but it does not imply-mtune=broadwell
. In fact, in this particular case,-mtune=broadwell
would be the best option:-mtune=generic
is worse.We know that, however, only with the benefit of hindsight: Skylake performs very much like Broadwell (which performs essentially identical to Haswell before it), so Broadwell is a good tune for Skylake. That certainly hasn’t always been the case though: when the switch to the P4 uarch was made, the tune for the “previous” arch would have been a bad match for P4, and same when P4 was in turn dropped in favor of a return to the PPro/PentiumM architecture.
So the rule of “use the latest arch (from same manufacturer?)” would have worked well recently but not in the past. It would also have trouble when some manufacturer doesn’t have a linear list of architectures, but rather also has various secondary archictectures, like Intel with Atom and the Phi/Knights* stuff.
The rule of “use generic tune” seems like a reasonable compromise, and also has the advantage of being easier to implement: no need to implement an ordering of architectures or deal with the various families etc. So even though I originally thought this was really dumb, I can see the logic.
Last note. You write:
I think you know this, but one should be clear that this only applies if you don’t also specify
-march
. Usually you want to specific-march
since the difference there is huge: newer instruction sets, and-mtune
comes along for the side.I hate no editing capabilities, and this typo is too important: it should read:
Thanks. This is an appropriate and timely bit of information, given my upcoming exercise. 🙂
I can somewhat understand the choice of compiler-default behaviors, but also expect it might wander a bit between versions. This should not matter for most folk, for most problems, but if you are working a problem targeted for a specific processor, this stuff matters.
For the longest time, a codebase I worked on had
-march=native -mtune=native
. It was just easier to let GCC figure things out instead of specifying the actual values, and it worked, so why bother?But it does. And this article is a great link to share with people who don’t know that.
The reason I had to change the code base was virtual machines. Some of the build was being done in a QEMU VM, so the CPU returned from
procinfo
was a QEMU. This broke the build entirely, since GCC couldn’t figure out what the CPU architecture was. But if it hadn’t been for that, I would not have been aware of the issues with-march=native -mtune=native
. So thank you for writing the article to bring this to more people’s attention.gcc-8.2 fixes the Skylake identification bug: https://www.phoronix.com/scan.php?page=news_item&px=GCC-8.2-Relased
If the compiler does not know the actual architecture – you mentioned that broadwell is not correct, just close enough – how is it going to know that tuning for broadwell is more appropriate than tuning generic? Because apparently it is not a broadwell.
It seems consistent to me apply generic tuning for a CPU that the compiler does not (yet) have enough details. It cannot just assume that broadwell tuning is the best choice for all future broadwell successor CPUs.
It seems consistent to me apply generic tuning for a CPU that the compiler does not (yet) have enough details.
It is not wrong, but I would argue that it is not possible to infer this behaviour from the documentation. So the net result is a surprise, and surprises are not good.
One of the longest running threads in compiler development, this is a great post with the key question asked, some valuable introspection tools, and the general state of things explained
The two key discussions are 1) march is generally incrementally inclusive across processor models/capabilities, and 2) the tools themselves adapt over time to the available models.
Worth noting that the underlying tools (assembler, linker) can be sensitive to these variables.
I wish gcc and clang would both auto-generate docs to show the tune/arch/HW (and if dependent on the OS) decision tree. Maybe I need to pony up some open source development effort…
Great thread indeed, very cool to get a better grip on this… was making the same assumptions and occasionally wondered about it… 🙂
I find it to be more of a documentation broad wording issue and not a bug per se. Where it says :
It exactly means cpu-type, not attribute-option. Since
native
it’s not acpu-type
but rather a compiler instruction to try to match the current architecture, it does not cascade to the-mtune
option, and is well within the wording. The confusing wording, but correct one.I am not sure I ever believed it was a bug. It is just complicated.
For what is worth, on godbolt’s x86-64 gcc 13.2, “-march=native –help=target -Q” now gives whatever CPU the server happens to be using in “-mtune”. Using the available versions I found that GCC 7.2 gives generic mtune, but GCC 7.3 does native. I am a bit too lazy to find the commit for now.