I was looking at sizing a data warehouse on AWS the other day, and it really seems to come down 10 Gb/s networking, and sharing it for everything: loading, inter-cluster, storage, and query (S3 or EBS). Each hop basically redistributes the data by a different scheme (eg the load stream has to be parsed to get the target node for each row), so network is pretty much guaranteed to be the bottleneck.
so network is pretty much guaranteed to be the bottleneck.
Assume that this is true. Then what is the consequence? Do you go for cheap nodes, given that, in any case, they will always be starved… Or do you go for powerful nodes to minimize network usage?
KWilletssays:
The query end depends heavily on SSD cache; basically terabytes of hot data. This stuff is often joined and aggregated locally, so network is less of a factor there (part of my job is to ensure this tendency).
SSD is only available on larger SKU’s, so using many small nodes would likely decrease query performance while increasing load or insert throughput. However there is also a split strategy of having storageless “compute” nodes which handle loading while the main cluster handles storage and querying.
Nathan Kurzsays:
I feel like an even simpler thing we are lacking is the ability to easily coordinate all the cores in a single machine to work on the same data. The cores tend to share a common L3, it seems agreed that getting data from RAM to L3 is one of the bottlenecks, but there don’t seem to be any efficient low level ways to help the cores share their common resources rather than competing for them.
This may seem like a silly complaint, since OS’s have smoothly managed threads and processes across cores for decades, but for a lot of problems this is far to high level, and the coordination is far too expensive. For example, I’d like is to be able to design “workflows” where each core can concentrate on each stage of a multistage process, where data can be passed between cores using the existing cache line coherence mechanisms.
Is there anything out there that does this? Or otherwise allows inter-core coordination at speeds comparable to L3 access?
It seems that our Knight Landing box has some fancy core-to-core communication. You have to have that when the number of cores goes way up.
And what about AMD’s Infinity Fabric and Intel’s mesh topology for Skylake?
Nathan Kurzsays:
I’d guess that the hardware support is already present. What’s lacking is a reasonable interface that allows access from userspace.
Consider the MONITOR/MWAIT pair: http://blog.andy.glew.ca/2010/11/httpsemipublic.html. It’s always seemed like there should be a way to use these to do high performance coordination between cores, and apparently they are finally available outside the kernel on KNL, but I’ve never seen a project that makes use of them.
Alternatively, consider the number of papers you read that provide good whole processor numbers versus those that provide only single-threaded numbers. Almost everything that shows multicore results does it by running multiple parallel instantiations of the same code. Is there really no way further optimizations possible?
Yiannis Papadopoulossays:
Most times you have to move data from main memory to accelerator memory, but a lot of new platforms allow for cache coherent unified memory that blurs the line between CPU core and accelerator; the accelerator memory, if any, is more of a cache than anything else.
The fundamental problem is not how to move lots of data, but how to move enough to keep the GPU/FPGA busy.
Sorry, I don’t get this exponentiality argument. GPU memory is much faster and much more expensive that commodity RAM. It is just not feasible to use GPU memory to store things: you can do it much cheaper without GPU.
But it sits on a weak PCI link. Unless you have a different interconnect such as infiniband.
Yiannis Papadopoulossays:
There is also NVlink (NVIDIA Volta, IBM Power9) and CPU+GPU products (e.g., AMD Kaveri, nearly all smartphone processors).
The discussion should not be about interconnects or unified memory or not, but how much is the latency to access non-local memory and how you can hide it.
The last time I checked on hybrid GPU/CPU solutions, I found them severely underpowered compared to the lineup of NVIDIA products.
I don’t get your comment about latency and interconnects. The quality of the interconnect directly affects latency. Also having a fancy interconnect adds complexity.
Neeraj Badlanisays:
Seems like interesting seminar . Would you happen to know if they record this ?
Thanks
They most certainly do not record it (they are not presentations but discussions), but there will be a written (free) report. Moreover, I will write more about some of these topics in the coming weeks and months.
aleccosays:
Agree wholeheartedly to the issue of moving things into the GPU from RAM. But… It is now possible to have direct SSD-GPU without going through CPU or main memory. The GPU maps a portion of its memory into the PCIe bus. See pg-strom (though it doesn’t seem to show much after 2016).
A problem GPUs with lots of memory are quite expensive. An Nvidia Tesla 24GB costs $500-$3000. And the way to use GPUs efficiently is with proprietary interfaces.
I was looking at sizing a data warehouse on AWS the other day, and it really seems to come down 10 Gb/s networking, and sharing it for everything: loading, inter-cluster, storage, and query (S3 or EBS). Each hop basically redistributes the data by a different scheme (eg the load stream has to be parsed to get the target node for each row), so network is pretty much guaranteed to be the bottleneck.
Assume that this is true. Then what is the consequence? Do you go for cheap nodes, given that, in any case, they will always be starved… Or do you go for powerful nodes to minimize network usage?
The query end depends heavily on SSD cache; basically terabytes of hot data. This stuff is often joined and aggregated locally, so network is less of a factor there (part of my job is to ensure this tendency).
SSD is only available on larger SKU’s, so using many small nodes would likely decrease query performance while increasing load or insert throughput. However there is also a split strategy of having storageless “compute” nodes which handle loading while the main cluster handles storage and querying.
I feel like an even simpler thing we are lacking is the ability to easily coordinate all the cores in a single machine to work on the same data. The cores tend to share a common L3, it seems agreed that getting data from RAM to L3 is one of the bottlenecks, but there don’t seem to be any efficient low level ways to help the cores share their common resources rather than competing for them.
This may seem like a silly complaint, since OS’s have smoothly managed threads and processes across cores for decades, but for a lot of problems this is far to high level, and the coordination is far too expensive. For example, I’d like is to be able to design “workflows” where each core can concentrate on each stage of a multistage process, where data can be passed between cores using the existing cache line coherence mechanisms.
Is there anything out there that does this? Or otherwise allows inter-core coordination at speeds comparable to L3 access?
It seems that our Knight Landing box has some fancy core-to-core communication. You have to have that when the number of cores goes way up.
And what about AMD’s Infinity Fabric and Intel’s mesh topology for Skylake?
I’d guess that the hardware support is already present. What’s lacking is a reasonable interface that allows access from userspace.
Consider the MONITOR/MWAIT pair: http://blog.andy.glew.ca/2010/11/httpsemipublic.html. It’s always seemed like there should be a way to use these to do high performance coordination between cores, and apparently they are finally available outside the kernel on KNL, but I’ve never seen a project that makes use of them.
Alternatively, consider the number of papers you read that provide good whole processor numbers versus those that provide only single-threaded numbers. Almost everything that shows multicore results does it by running multiple parallel instantiations of the same code. Is there really no way further optimizations possible?
Most times you have to move data from main memory to accelerator memory, but a lot of new platforms allow for cache coherent unified memory that blurs the line between CPU core and accelerator; the accelerator memory, if any, is more of a cache than anything else.
The fundamental problem is not how to move lots of data, but how to move enough to keep the GPU/FPGA busy.
Sorry, I don’t get this exponentiality argument. GPU memory is much faster and much more expensive that commodity RAM. It is just not feasible to use GPU memory to store things: you can do it much cheaper without GPU.
I think the model is to view the GPU as a smart cache. Whether it can deliver value outside of specific applications… I do not know.
But it sits on a weak PCI link. Unless you have a different interconnect such as infiniband.
There is also NVlink (NVIDIA Volta, IBM Power9) and CPU+GPU products (e.g., AMD Kaveri, nearly all smartphone processors).
The discussion should not be about interconnects or unified memory or not, but how much is the latency to access non-local memory and how you can hide it.
The last time I checked on hybrid GPU/CPU solutions, I found them severely underpowered compared to the lineup of NVIDIA products.
I don’t get your comment about latency and interconnects. The quality of the interconnect directly affects latency. Also having a fancy interconnect adds complexity.
Seems like interesting seminar . Would you happen to know if they record this ?
Thanks
They most certainly do not record it (they are not presentations but discussions), but there will be a written (free) report. Moreover, I will write more about some of these topics in the coming weeks and months.
Agree wholeheartedly to the issue of moving things into the GPU from RAM. But… It is now possible to have direct SSD-GPU without going through CPU or main memory. The GPU maps a portion of its memory into the PCIe bus. See pg-strom (though it doesn’t seem to show much after 2016).
A problem GPUs with lots of memory are quite expensive. An Nvidia Tesla 24GB costs $500-$3000. And the way to use GPUs efficiently is with proprietary interfaces.