I hope one day, me and one of my Intel CPU cores will be real buddies, hanging around, talking smack and I won’t choke him and his 127 twins to 99% anymore 🙂
Michael Haysays:
Intel has Altera’s acquisition also up their sleeves. They could more easily build learning assists, AI systems, etc. with flexible hardware. The challenge with FPGAs today is the toolchain is at least 10 years behind standard C, Java, etc. compilers. So there is a lot of work to do. We kind of need something like the defunct project from IBM called LIME or Maxeler’s OpenSPL efforts.
To me the magic of the human mind is that the software and the wetware and the software evolve together. With silicon hardware and software there is a very slow mutual evolution.
Srigisays:
How exactly this kind of instructions are feeded by RAM?
Because if you got 64B per tick (simplified), I suppose CPU run out of data supply pretty quick. RAM/cache bandwidth is way lower than this processing speed.
We would need to see what the specifics are… My experience has been that the L1 cache is fast enough that even with AVX-512, cache speed is not a bottleneck. However, if you can’t load up the data in cache fast enough, my experience has been that RAM access speed is already a major bottleneck, even without any fancy instruction.
marcsays:
If this is unspecific vector computation, is means deep learning is a hype label for what we have been enjoying with matlab/R/numpy vector computation for ages ? OK, deep learning has high impact, but it is much more specific than “vector computation”
Do we have a rough idea of the performance gains that will be brought by those new Intel instructions? In other words, how will it compare to GPUs, performance-wise?
In other words, how will it compare to GPUs, performance-wise?
Given that we do not even know what these instructions are, exactly, it is hard to know exactly.
However, we can tell a few things from basic knowledge of Intel technology. Intel processors cannot compete on raw processing speed with GPUs. It could be that I am wrong, but I don’t think that these instructions will change this picture.
GPUs are powerful, true… but they are also specialized. This makes them a poor fit for many common problems… whereas Intel’s CPU are much more broadly applicable.
So, what if you have problems where deep learning is only part of what your system has to do? Then maybe the GPU is no longer the best solution. Maybe some Intel processor with both general purpose and deep learning capabilities becomes the best bet.
We should keep in mind that Intel’s money comes (largely) from cloud infrastructures. Intel has to convince people like Amazon and Google to buy its processors. These people do a lot of work besides deep learning.
I was disappointed to not actually see the details of the instructions, but I guess we can always speculate.
I suspect that these will be tailored towards int8,int16,fp16,fp32 512 bit wide dot product and accumulate (useful for inference), and possibly instructions to accelerate FFTs and convolutions (similar to the SSE4.2 string instructions, but for convolutions instead of string matching).
With a 64 wide int8 one per cycle throughput dot product (128 ops), which is more and more feasible for inference (but not training) of neural networks, a 32 core system could perform 128×32 = 4096 int8 ops per cycle, or around 8 TOPS on a 2GHz system. This is less than the 40TOPS the dp4a instruction can get on a Titan X, but it’s at least in the same ballpark. It probably wouldn’t burn too much area either.
A 32 width fp16 dot product operator (64 flops/cycle) would be at 4TFLOPs, which compares favorably with the 10TFLOPs available on a Titan X, but would take significantly more area and probably need a deeper pipeline.
Direct convolution instructions would play to the strengths of the SIMD model (eg, the ability to parallel shift like in the string instructions), and may be able to provide impressive performance especially in int8 mode.
Head. Ssays:
What do you think about AMD “Bristol Ridge”? Could we take those GPU cores for the purpose?
I have not been following AMD at all, but most of our main CPUs have integrated GPUs which are often idle. This seems wasteful, but I have not heard much about it.
I hope one day, me and one of my Intel CPU cores will be real buddies, hanging around, talking smack and I won’t choke him and his 127 twins to 99% anymore 🙂
Intel has Altera’s acquisition also up their sleeves. They could more easily build learning assists, AI systems, etc. with flexible hardware. The challenge with FPGAs today is the toolchain is at least 10 years behind standard C, Java, etc. compilers. So there is a lot of work to do. We kind of need something like the defunct project from IBM called LIME or Maxeler’s OpenSPL efforts.
To me the magic of the human mind is that the software and the wetware and the software evolve together. With silicon hardware and software there is a very slow mutual evolution.
How exactly this kind of instructions are feeded by RAM?
Because if you got 64B per tick (simplified), I suppose CPU run out of data supply pretty quick. RAM/cache bandwidth is way lower than this processing speed.
We would need to see what the specifics are… My experience has been that the L1 cache is fast enough that even with AVX-512, cache speed is not a bottleneck. However, if you can’t load up the data in cache fast enough, my experience has been that RAM access speed is already a major bottleneck, even without any fancy instruction.
If this is unspecific vector computation, is means deep learning is a hype label for what we have been enjoying with matlab/R/numpy vector computation for ages ? OK, deep learning has high impact, but it is much more specific than “vector computation”
Vector instructions at the hardware level are related but distinct from vector-oriented programming.
For a moment I thought Intel will add deep learning for just-in-time optimizations of code.
Do we have a rough idea of the performance gains that will be brought by those new Intel instructions? In other words, how will it compare to GPUs, performance-wise?
In other words, how will it compare to GPUs, performance-wise?
Given that we do not even know what these instructions are, exactly, it is hard to know exactly.
However, we can tell a few things from basic knowledge of Intel technology. Intel processors cannot compete on raw processing speed with GPUs. It could be that I am wrong, but I don’t think that these instructions will change this picture.
GPUs are powerful, true… but they are also specialized. This makes them a poor fit for many common problems… whereas Intel’s CPU are much more broadly applicable.
So, what if you have problems where deep learning is only part of what your system has to do? Then maybe the GPU is no longer the best solution. Maybe some Intel processor with both general purpose and deep learning capabilities becomes the best bet.
We should keep in mind that Intel’s money comes (largely) from cloud infrastructures. Intel has to convince people like Amazon and Google to buy its processors. These people do a lot of work besides deep learning.
I was disappointed to not actually see the details of the instructions, but I guess we can always speculate.
I suspect that these will be tailored towards int8,int16,fp16,fp32 512 bit wide dot product and accumulate (useful for inference), and possibly instructions to accelerate FFTs and convolutions (similar to the SSE4.2 string instructions, but for convolutions instead of string matching).
With a 64 wide int8 one per cycle throughput dot product (128 ops), which is more and more feasible for inference (but not training) of neural networks, a 32 core system could perform 128×32 = 4096 int8 ops per cycle, or around 8 TOPS on a 2GHz system. This is less than the 40TOPS the dp4a instruction can get on a Titan X, but it’s at least in the same ballpark. It probably wouldn’t burn too much area either.
A 32 width fp16 dot product operator (64 flops/cycle) would be at 4TFLOPs, which compares favorably with the 10TFLOPs available on a Titan X, but would take significantly more area and probably need a deeper pipeline.
Direct convolution instructions would play to the strengths of the SIMD model (eg, the ability to parallel shift like in the string instructions), and may be able to provide impressive performance especially in int8 mode.
What do you think about AMD “Bristol Ridge”? Could we take those GPU cores for the purpose?
@Head
I have not been following AMD at all, but most of our main CPUs have integrated GPUs which are often idle. This seems wasteful, but I have not heard much about it.