I use SIMD a lot in assembly language. If you are doing arithmetic you see a 4 by or 8 by speedup depending on the SIMD slot size. If you have do something no covered by the SIMD instruction set and can’t find a bit hack then you are back to 1 by. Also L1, L2 caching issues can limit SIMD when processing large arrays.
SIMD is very easy to program though compared to GPU multicore.
I think genuine CPU multicore with say 1000 RISC cores on one chip with good caching and memory behavior would be both relatively easy to program and a lot faster for say AI applications. There were only 56,000 transistors on a Z80 CPU in the 1970’s. Actually Intel are killing themselves with super high complexity CPUs that are now full of bugs: https://hothardware.com/news/intel-cpu-bug-kernel-memory-isolation-linux-windows-macos
Memory access problems can make all forms of CPU optimizations irrelevant. You have to first keep your processors fed with data before anything else comes into play.
The more cores you add, the more likely that memory access will become an issue.
jldsays:
Why are your software samples so Yuuuge for such tiny programs?
(also wanted to see your Roaring bit maps but deleted it without even unzipping when I saw the size)
Why are you software samples so Yuuuge for such tiny programs?
What is a software sample? If you are alluding to the source code, I wrote very little of it and just modified existing programs. The single-threaded C program is quite short. All these programs are very short thought the multithreaded code takes up more lines.
Myself, I wrote maybe 20 lines of code in preparation for this blog post.
Regarding the size of the Roaring Bitmap code releases… The Java source code for the Java Roaring Bitmap library is 150 kB. That’s a long running piece of software written in a verbose language with lots and lots of features and two dozen contributors from diverse organizations. This source code litterally fits in the cache of one of the cores on your CPU. The average web page is 2.3 MB (https://www.wired.com/2016/04/average-webpage-now-size-original-doom/) so ten times larger.
In a lot of settings (corporate) it is more preferable to have readable code over pure performance, that’s why we see the trend of threads and goroutines.
I think what would be more appealing is if we could somehow make SIMD stupid simple.
When we run benchmarks, we generally assume that the system is otherwise idle. This may be a reasonable assumption in mobile/client code, but it tends to underestimate the benefits of multithreaded server code.
Most HPC/cloud systems have little free capacity. If you’re not using it, you don’t want to pay for it. If your code isn’t using all CPU cores, some other code probably is. The CPU runs slower than in single-threaded benchmarks, and even single-threaded code is competing with other processes for memory, caches, and memory bandwidth. If 32-threaded code gives you an 18x speedup over single-threaded code in benchmarks, you probably won’t get any better performance with 32 independent single-threaded jobs.
Running independent jobs is the easiest way to take advantage of many CPU cores. Unfortunately, at least in bioinformatics, you often run out of memory before you can utilize all CPU cores. Multithreading is the second-best solution. If many jobs share the same read-only data structures, you can often combine them easily into a single job. With well-designed code, it might only require a few OMP statements or similar constructs. On the other hand, SIMD instructions and other low-level improvements can be expensive, as you need to understand the code in order to use them.
The AVX2 unit uses a different amount of energy to the ALU, so it’s not strictly better either
This has been well documented:
It is a well known that SIMD computing is energy efficient. Our results show that under the ideal conditions vector instruction can increase performance with only a nominal increase in dynamic energy consumption. (…) Doubling the SIMD width with AVX instructions doubles performance and only increases total power by 4.3%, which results in a 1.9x improvement in energy efficiency. Similarly, the use of the FMA instructions can double the performance and only increase power by 5.0%, which also nearly doubles energy efficiency. By fully exploiting both the FMA instructions and the wide SIMD width of the AVX instructions, HSW can achieve up to 6.3 GFlops/watt, a 7x improvement over the 0.9 GFlops/watt on PNY. (Czechowski et al.)
For a more extreme example, the new AVX512 instructions can draw so much power that the CPU has to throttle back its frequency to fit within the TDP.
Are you aware of a formal study that reviewed the energy usage of AVX-512 code?
Sallessays:
Another reason parallel programs may not scale so well is because of techniques such as Intel Turbo Boost: the clock of your computer is typically higher if you are using fewer cores.
I use SIMD a lot in assembly language. If you are doing arithmetic you see a 4 by or 8 by speedup depending on the SIMD slot size. If you have do something no covered by the SIMD instruction set and can’t find a bit hack then you are back to 1 by. Also L1, L2 caching issues can limit SIMD when processing large arrays.
SIMD is very easy to program though compared to GPU multicore.
I think genuine CPU multicore with say 1000 RISC cores on one chip with good caching and memory behavior would be both relatively easy to program and a lot faster for say AI applications. There were only 56,000 transistors on a Z80 CPU in the 1970’s. Actually Intel are killing themselves with super high complexity CPUs that are now full of bugs:
https://hothardware.com/news/intel-cpu-bug-kernel-memory-isolation-linux-windows-macos
Memory access problems can make all forms of CPU optimizations irrelevant. You have to first keep your processors fed with data before anything else comes into play.
The more cores you add, the more likely that memory access will become an issue.
Why are your software samples so Yuuuge for such tiny programs?
(also wanted to see your Roaring bit maps but deleted it without even unzipping when I saw the size)
Why are you software samples so Yuuuge for such tiny programs?
What is a software sample? If you are alluding to the source code, I wrote very little of it and just modified existing programs. The single-threaded C program is quite short. All these programs are very short thought the multithreaded code takes up more lines.
Myself, I wrote maybe 20 lines of code in preparation for this blog post.
Regarding the size of the Roaring Bitmap code releases… The Java source code for the Java Roaring Bitmap library is 150 kB. That’s a long running piece of software written in a verbose language with lots and lots of features and two dozen contributors from diverse organizations. This source code litterally fits in the cache of one of the cores on your CPU. The average web page is 2.3 MB (https://www.wired.com/2016/04/average-webpage-now-size-original-doom/) so ten times larger.
You can grab the released jars on maven:
http://central.maven.org/maven2/org/roaringbitmap/RoaringBitmap/0.6.66/
What about readability?
In a lot of settings (corporate) it is more preferable to have readable code over pure performance, that’s why we see the trend of threads and goroutines.
I think what would be more appealing is if we could somehow make SIMD stupid simple.
I agree but I don’t think multithreaded programming is stupid simple. It has been made simpler but that’s the result of a lot of effort.
When we run benchmarks, we generally assume that the system is otherwise idle. This may be a reasonable assumption in mobile/client code, but it tends to underestimate the benefits of multithreaded server code.
Most HPC/cloud systems have little free capacity. If you’re not using it, you don’t want to pay for it. If your code isn’t using all CPU cores, some other code probably is. The CPU runs slower than in single-threaded benchmarks, and even single-threaded code is competing with other processes for memory, caches, and memory bandwidth. If 32-threaded code gives you an 18x speedup over single-threaded code in benchmarks, you probably won’t get any better performance with 32 independent single-threaded jobs.
Running independent jobs is the easiest way to take advantage of many CPU cores. Unfortunately, at least in bioinformatics, you often run out of memory before you can utilize all CPU cores. Multithreading is the second-best solution. If many jobs share the same read-only data structures, you can often combine them easily into a single job. With well-designed code, it might only require a few OMP statements or similar constructs. On the other hand, SIMD instructions and other low-level improvements can be expensive, as you need to understand the code in order to use them.
On the other hand, SIMD instructions and other low-level improvements can be expensive, as you need to understand the code in order to use them.
Designing algorithms for SIMD is hard. I agree. I am not sure it needs to hard, however.
>Thus you may be able to complete the processing faster, but it is unclear if, all things said, you are going to use less energy
The AVX2 unit uses a different amount of energy to the ALU, so it’s not strictly better either 🙂
For a more extreme example, the new AVX512 instructions can draw so much power that the CPU has to throttle back its frequency to fit within the TDP.
The AVX2 unit uses a different amount of energy to the ALU, so it’s not strictly better either
This has been well documented:
For a more extreme example, the new AVX512 instructions can draw so much power that the CPU has to throttle back its frequency to fit within the TDP.
Are you aware of a formal study that reviewed the energy usage of AVX-512 code?
Another reason parallel programs may not scale so well is because of techniques such as Intel Turbo Boost: the clock of your computer is typically higher if you are using fewer cores.