So, the branch predictor incrementally learns a 2000-iteration branch sequence to a pretty high level of accuracy? I must say I’m curious about the internal state and evolution of this machinery. I suppose Intel doesn’t really provide much public details on it, could it be a Hopfield network or something similar, maybe mixing in some stochastic evolution?
I have updated my post with numbers from AMD Rome.
foobarsays:
I suppose AMD has advertised their branch predictor using “neural network” in its operation – which on this kind of silicon would probably be more of a perceptron type than those tensor engines that hyped up NNs are nowadays.
Travis Downssays:
Oddly, they are using that ever before the current ML hype. On their recent chips they are also using TAGE, but still have the perception too – it isn’t clear to me what is responsible for what and if they are arranged in a hierarchy or what.
It does well on this kind of test, because it uses a limited history of say 20-30 branches as a lookup into a history table. Here, each 20-30 branch sequence, with very high probability identifies the exact position within the random sequence. I.e., the “tag” has a high amount of information.
On the other hand, TAGE does poorly for apparently much simpler cases. Consider a nested loop with a fixed number of iterations. Here, Intel chips fail to predict the inner loop exit with as few as 30ish iterations (the transition is sharp, from 100% to 0% basically when you cross the threshold).
That’s because the tag is now very sparse in entropy/information. Assuming a taken branch is 1, the history looks like 1111111011111110…. for 7 iteration loop. I.e., mostly ones. Even though a 26-bit tag (say) could distinguish almost 70 million patterns (and in Daniel’s random test you could get close if you didn’t hit some other resource limit), in this case it’s basically all 1s. The only think you can key on is that occasional 0, and as soon as you get to 27 interactions, the 0 won’t appear in a 26-bit key, and all the predictor sees is a long string of 1s and it always fails.
This is why you sometimes see recommendations to keep nested loops to less than 16.
Intel chips used to have a specific loop predictor which could remember iteration counts for loops like this (indeed, a much easier problem in principle than remembering a series of 2,000 random branches) but it was removed sometime around the start of the Core series, IIRC.
foobarsays:
I wonder how these graphs would look like if one could completely flush (or reset to a known state, say, “all branches not taken”) the predictor state before the run. Maybe what I thought was not so much incremental “learning” as it might have been incremental “unlearning” of branch histories which were essentially random in the context of this benchmark.
Which makes me actually wonder how much one might be able to fish information from the branch predictor in the regard of its past by careful timing or performance counter measurements.
Travis Downssays:
Yeah, the same things that make TAGE good at predicting long series of correlated branch directions makes it hard to flush: it’s not just a matter of executing a large number of branches at different addresses, since those will never access the longer history tables (or maybe will only access the longer tables).
You basically need a TAGE-aware flusher that uses mutilple sequences each which targets a specific TAGE length (and you might need mutilple depending on if there are independent TAGE states eg hashed by address). Furthermore there are probably, fast-but dumb non-TAGE predictors to catch easy cases that you’ll have to flush and defeat as well.
I think you can saw that most of the behavior is incremental learning, not unlearning. At least the long tail where the predictions become very good should be mostly “learning” since unlearning really just means you can go from 0% to 50% (totally wrong, to random guess i.e., right half the time). To get above 50 and towards 99 must be learning.
That said, there is probably some unlearning right at the start: you’ll get a lot of false predictions (still 50% tho) for short TAGE lengths since you’ll be matching against earlier hits that don’t have any relationship. I don’t think that’s a major effect.
Interesting challenge: design a series of branches which gets a 100% misprediction rate. You could do it if you knew the internal behavior of the predictor, but even if not you could maybe do it incrementally.
foobarsays:
Interesting points, thanks for the insight!
Zelaktsays:
I’m confused. If the branch is completely random, what is there to predict? How is branch prediction improving?
So, the branch predictor incrementally learns a 2000-iteration branch sequence to a pretty high level of accuracy? I must say I’m curious about the internal state and evolution of this machinery. I suppose Intel doesn’t really provide much public details on it, could it be a Hopfield network or something similar, maybe mixing in some stochastic evolution?
I have updated my post with numbers from AMD Rome.
I suppose AMD has advertised their branch predictor using “neural network” in its operation – which on this kind of silicon would probably be more of a perceptron type than those tensor engines that hyped up NNs are nowadays.
Oddly, they are using that ever before the current ML hype. On their recent chips they are also using TAGE, but still have the perception too – it isn’t clear to me what is responsible for what and if they are arranged in a hierarchy or what.
I’m ordering an AMD Rome server this week.
Look up TAGE.
It does well on this kind of test, because it uses a limited history of say 20-30 branches as a lookup into a history table. Here, each 20-30 branch sequence, with very high probability identifies the exact position within the random sequence. I.e., the “tag” has a high amount of information.
On the other hand, TAGE does poorly for apparently much simpler cases. Consider a nested loop with a fixed number of iterations. Here, Intel chips fail to predict the inner loop exit with as few as 30ish iterations (the transition is sharp, from 100% to 0% basically when you cross the threshold).
That’s because the tag is now very sparse in entropy/information. Assuming a taken branch is 1, the history looks like 1111111011111110…. for 7 iteration loop. I.e., mostly ones. Even though a 26-bit tag (say) could distinguish almost 70 million patterns (and in Daniel’s random test you could get close if you didn’t hit some other resource limit), in this case it’s basically all 1s. The only think you can key on is that occasional 0, and as soon as you get to 27 interactions, the 0 won’t appear in a 26-bit key, and all the predictor sees is a long string of 1s and it always fails.
This is why you sometimes see recommendations to keep nested loops to less than 16.
Intel chips used to have a specific loop predictor which could remember iteration counts for loops like this (indeed, a much easier problem in principle than remembering a series of 2,000 random branches) but it was removed sometime around the start of the Core series, IIRC.
I wonder how these graphs would look like if one could completely flush (or reset to a known state, say, “all branches not taken”) the predictor state before the run. Maybe what I thought was not so much incremental “learning” as it might have been incremental “unlearning” of branch histories which were essentially random in the context of this benchmark.
Which makes me actually wonder how much one might be able to fish information from the branch predictor in the regard of its past by careful timing or performance counter measurements.
Yeah, the same things that make TAGE good at predicting long series of correlated branch directions makes it hard to flush: it’s not just a matter of executing a large number of branches at different addresses, since those will never access the longer history tables (or maybe will only access the longer tables).
You basically need a TAGE-aware flusher that uses mutilple sequences each which targets a specific TAGE length (and you might need mutilple depending on if there are independent TAGE states eg hashed by address). Furthermore there are probably, fast-but dumb non-TAGE predictors to catch easy cases that you’ll have to flush and defeat as well.
I think you can saw that most of the behavior is incremental learning, not unlearning. At least the long tail where the predictions become very good should be mostly “learning” since unlearning really just means you can go from 0% to 50% (totally wrong, to random guess i.e., right half the time). To get above 50 and towards 99 must be learning.
That said, there is probably some unlearning right at the start: you’ll get a lot of false predictions (still 50% tho) for short TAGE lengths since you’ll be matching against earlier hits that don’t have any relationship. I don’t think that’s a major effect.
Interesting challenge: design a series of branches which gets a 100% misprediction rate. You could do it if you knew the internal behavior of the predictor, but even if not you could maybe do it incrementally.
Interesting points, thanks for the insight!
I’m confused. If the branch is completely random, what is there to predict? How is branch prediction improving?
Importantly, from run to run, we generate the same random integers. We use pseudo-random numbers.