6th December 2019, 10 min read

AMD Zen 2 and branch mispredictions

11 thoughts on “AMD Zen 2 and branch mispredictions”

Travis Downs says:

December 6, 2019 at 9:53 pm

Well the rng is in the timed loop as well, so I am not sure you can be certain the difference is solely in the branch prediction part.

The timing looking more or less in line with what I’d expect: each branch can’t resolve until the rng value is calculated, and there are 3 shifts, 3 xors, and 2 muls in the rng, for a total of 12 cycles of latency, plus the & 1 and & 2 ops for 2 more cycles, so 14 cycles. The “standard” BP latency for Intel is usually quoted as 16 cycles, so 14 + 16 = 30, almost exactly in line with your results.

A more precise test would move the rng outside of the timed loop, although the most obvious way to do that (read from an array) means you now how to be careful of caching effects. Another way would be to run the loop w/o mispredicts to see if the (latency limited) baseline is the same.
1. Daniel Lemire says:
  
  December 6, 2019 at 11:46 pm
  
  I agree that my benchmark is not sufficient to support my conclusion but the opposite might be true: the misprediction cost could be worse.
  1. Travis Downs says:
    
    December 7, 2019 at 12:41 am
    
    Indeed, my comment cuts both ways.
    
    Another way to check, would be to vary the mispredict rate from 0% to 100% (100% being what you have now) – if AMD and Intel times overlap at 0% and then diverge by 2 cycles at 100% you have strong proof.
-.- says:

December 6, 2019 at 11:33 pm

According to https://www.7-cpu.com/ Skylake mispredict penalty is 16.5 cycles and Zen1 is 19 cycles. Zen2 results aren’t there, but it may not have changed from Zen1.

Zen does seem to have a curiously long pipeline – about as long as the Bulldozer family. I’ve heard speculation that the intent was to allow the chip to clock fairly high (but the process didn’t allow it).
Interestingly to note that Icelake increases the mispredict penalty by 1 cycle, so there also may be a trend towards longer pipelines.
1. Travis Downs says:
  
  December 7, 2019 at 12:38 am
  
  The penalty (apparently) also varies depending on whether the code after the mispredict hits in the uop cache or not. Intuitively this makes sense: if you need to decode the new target, you add all the decode stages to the penalty. I recall, however, that Agner said he couldn’t measure a difference.
  
  Those numbers seem fairly consistent with the gap Daniel found.
  1. Daniel Lemire says:
    
    December 7, 2019 at 4:41 pm
    
    My best guess after hacking a bit more (and adding a few more details) is that Zen 2 has an extra cycle of penalty per mispredicted branch compared with Skylake.
Ivan says:

December 7, 2019 at 5:13 pm

While correct your post is misleading…
Real performance depends on branch prediction cost and the count of missed branches.

In your example they are the same for Intel and AMD(since code is totally unpredictable), but in more realistic scenario it is possible that AMD has better branch prediction.

You could benchmark that but it is very tricky to say what is “realistic branchy” code.
1. Daniel Lemire says:
  
  December 7, 2019 at 7:02 pm
  
  Right. So by design, I made the branches unpredictable.
  
  I expect that Zen 2 has better branch prediction.
Chou Keihou says:

December 12, 2019 at 1:28 pm

Well, some similar analysis has been done on the branch misprediction penalty of zen uarch with bdw and skl. https://www.evolife.cn/computer/54351.html/6

“Zen’s branch prediction penalty is around 17 to 21 cycles, Kaby Lake is 16 to 20 cycles and Broadwell is 15 to 21 cycles. In general, Broadwell’s branch penalty is generally lower, about 15 cycles, KabyLake is slightly higher, about 17 cycles, and Zen’s predicted penalty value is generally about 19 cycles.
Through testing, we infer that Zen’s pipeline number of bits should be around 19, but due to factors such as µOp-Cache (micro-operation cache), it can be as low as about 17 cycles.”

This is not a new thing sence zen(zen2) has got a deeper pipeline and this is NOT necessarily relative to IPC (bdw’s penalty is lower than skylake and if you are using a 8086 processor the penalty is only 4 cycles).

If you want to take branch misprediction penalty’s affect into consideration it will be better to combine it with branch misprediction rate using some real-world workload instead of a random integer where you will get 50% branch misprediction rate.

Also, by saying “my good old Intel Skylake (2015) processor”, I hope you didn’t forget that Intel failed to ship a new architecture for server and desktop platform(where high perf is really in need) in 4 years and the NEW architecture sunny cove(cannonlake is dead, intel even wants everyone to forget about it together with first gen 10nm process) is limited to ultrabook. That’s why ZEN2 is on the same stage competing with intel’s SKL uarch. And for 2020 intel will ship cometlake(skylake refresh refresh refresh refresh) to survive the year.
1. Daniel Lemire says:
  
  December 12, 2019 at 2:10 pm
  
  Thanks for the detailed comment.
  
  I have alluded to Zen 2’s better branch predictor in the past on this blog and I will come back to it.
Burak says:

March 1, 2020 at 5:31 pm

I am having trouble to understand that when I use specific cpu/pid pinning on linux-perf-events.h I get one cycle less on Zen+,

#include <sys/types.h> .... pid_t pid2 = getpid(); const int cpu = 1; // 0 indexed, second cpu .... fd = syscall(__NR_perf_event_open, &attribs, pid2, cpu, group, flags);

then running via (taskset on debian):

make clean && make && taskset 0x2 ./condrng

gives me:

cond 4.56 cycles/value 15.00 instructions/value branch misses/value 0.00 cond 32.59 cycles/value 19.00 instructions/value branch misses/value 1.00

But without this pinning I get:

cond 4.51 cycles/value 15.00 instructions/value branch misses/value 0.00 cond 33.90 cycles/value 19.00 instructions/value branch misses/value 1.00

Man page (http://man7.org/linux/man-pages/man2/perf_event_open.2.html) states that current version in repo seems valid, all I can suspect is that cpu lookup is adding overhead