Daniel Lemire's blog

, 14 min read

Iterating in batches over data structures can be much faster…

13 thoughts on “Iterating in batches over data structures can be much faster…”

  1. Stuart says:

    I cannot agree with this more strongly. I’ve made improvements to InfluxDB cursors to move batches of time series data between stages that with measurable performance improvements 👍🏻

  2. Simon says:

    Try doing batch operations with gcc and its __builtin_prefetch() command. Try doing a batch of n (e.g. 8 or so; experiment with different batch sizes for your particular CPU) prefetches for n different memory addresses, immediately (or sometimes you can do other work in between because prefetching memory can be SLOW…) followed by a batch of n actual operations on the memory just fetched. I’ve seen massive latency savings in the past doing this. Good luck!

    [1] https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

    1. My experience with __builtin_prefetch() for performance has not been fruitful. I use it as part of benchmarks to help ensure that the cache is properly populated, but otherwise I rely on the processor to prefetch.

      1. Simon says:

        It is difficult to get results using __builtin_prefetch() but don’t give up! I’ve found it easiest to get results using it in a batch. There are a special circumstances where it’s going to help: (a) Don’t make your batch size too big or too small. The CPU can only prefetch so many cache lines at a time. (b) Only use __builtin_prefetch() if you’re expecting two or more cache lines fetches not to be in the cache, otherwise it’ll just get in the way. Generally speaking the data structures you’re working on need to be much bigger than the CPU cache memory and not cached. I wonder if you have not had success because you tend to experiment with smaller ‘demo sized’ data structures rather than ‘production sized’ data structures? An example, of a situation where __builtin_prefetch() would be useful is looking up a batch of keys in a larger hash table. HTH and don’t give up on __builtin_prefetch() ! 🙂

  3. Can you give me a short C program (say 30 lines of code) where __builtin_prefetch() helps tremendously, and such that the code without __builtin_prefetch() is not simply poorly designed so as to make __builtin_prefetch() look good.

    That is, give me an example where __builtin_prefetch() helps a lot and where I cannot, using straight portable C, get the same kind of performance.

    Feel free to use lots of RAM.

    1. And to set the rules, let us agree to use a recent Intel processor with a recent C compiler on Linux.

  4. Simon says:

    Have a look at [1] which uses a 1 GB block of memory and then runs a series of experiments on it by looping over it in differing batch sizes (1, 2, 4, 8, 16, 32, 64, 128, and 256), differing increments between reads (1, 65, 513, and 525,225 bytes), and with and without __builtin_prefetch(). Each experiment reads from the block of memory the same number of times; 500 million times.

    It should be possible for you to easily run this on your Linux box with little or no changes. The results will likely be sensitive to the exact CPU type, and the speed of the RAM. Please let me know how you get on. Also, I created this example especially for you, so it’s possible there are flaws / bugs. If you find any then please let me know! 🙂

    The results show that the biggest gains are using __builtin_prefetch() where auto prefetching has little effect and where the cache also has little effect. Let’s look at the inc 524,225 results which represent the slowest, non-cached memory reads. The fastest batch size on my CPU without prefetch is batch size 2 with 84.5 million reads per second. And with prefetch this jumps up to 94.3 million reads per second, or 11.5% faster. However, with a batch size of 64 then using prefetch is 40% faster.

    I think the experiment is also interesting because it shows how effective the CPU cache is. For example, the best result without prefetching is 970 million per second with batch size 16 with inc 1. And the worst is 70.4 million per second with batch size 16 and inc 524,225. So the fastest with CPU cache acceleration (without prefetching) is 13.8 times faster… over an order of magnitude faster. This also shows that between the prefetch loop and the batch loop, it would be possible to do quite a lot of CPU work on cached memory while waiting for the prefetched RAM to be cached. Perhaps a further experiment could prove this?

    Also, interesting is that without prefetch reads per second generally get consistently slower as the batch size gets bigger, except for batch size 1 or inc 1. However, that’s not the case with prefetch because as the batch size gets bigger the reads per second do not necessarily get consistently slower.

    Looking forwards to any comments.

    [1] https://gist.github.com/simonhf/caaa33ccb87c0bf0775a863c0d6843c2

  5. Simon says:

    It’s interesting when you try this code on an Amazon box… which I did (but not for the results above). I found an unused Amazon c3.8xl box. Of course, it’s unused by me… but there maybe other (‘noisy’?) neighbours on the box doing stuff. Although the virtualization layer gives each OS running a fair slice of CPU time… AFAIK there is no way to virtualize the CPU cache… which means that the fair slice of CPU time really depends upon how much memory is cached. We saw above that having memory cached and make an order of magnitude difference.

    So I the tests seven times on the Amazon box and here is one set of results which varied enormously from test run to test run:

    500000000 incs in 8.485089 seconds or 58926901 incs per second without prefetch using batch_size 8 and inc 524225
    500000000 incs in 23.625080 seconds or 21163949 incs per second without prefetch using batch_size 8 and inc 524225
    500000000 incs in 8.736761 seconds or 57229446 incs per second without prefetch using batch_size 8 and inc 524225
    500000000 incs in 35.973597 seconds or 13899083 incs per second without prefetch using batch_size 8 and inc 524225
    500000000 incs in 9.903223 seconds or 50488613 incs per second without prefetch using batch_size 8 and inc 524225
    500000000 incs in 8.794001 seconds or 56856941 incs per second without prefetch using batch_size 8 and inc 524225
    500000000 incs in 10.606903 seconds or 47139113 incs per second without prefetch using batch_size 8 and inc 524225

    Most test runs, this particular test took between 8.5 and 10.6 seconds. However, on two of the test runs it took 23.6 and 36 seconds! Surely this must be because of ‘noisy CPU cache neighbors’?

    1. Thanks. I am not ignoring your code… there might just be a delay before I get to it because I have many urgent tasks to complete in a timely manner.

      1. Simon says:

        No worries 🙂

  6. Simon says:

    I’ve been looking into this a bit more and discovered that when you run the same memory bandwidth benchmark on the same hardware but with different OSs then the results can differ in several different ways:

    The benchmark can simply be generally faster for some OSs.
    The benchmark can be slower for hyperthreads for some OSs.
    The benchmark percentile at 90% and above can be abnormally slow.

    For example, I have included raw results below (because cannot figure out how to attach a graph) of 6 sessions; 3 different types of ‘bare metal’ hardware from packet net [1] (ARM c2, Intel Xeon m1, and different Intel Xeon m2) running the same benchmark under 2 different OSs. In each session the benchmark is run 100s of times and the percentile results are on the x axis.

    read 720 lines: max-packet-net-centos7/c2.medium/cache-line-example-2-b.log ;[1.03 p0][1.04 p5][1.04 p10][1.04 p15][1.04 p20][1.04 p25][1.04 p30][1.05 p35][1.05 p40][1.05 p45][1.06 p50][1.06 p55][1.06 p60][1.06 p65][1.06 p70][1.06 p75][1.06 p80][1.06 p85][1.06 p90][1.07 p95][1.07 p100]
    read 720 lines: max-packet-net-centos7/m1.xlarge/cache-line-example-2-b.log ;[1.02 p0][1.03 p5][1.05 p10][1.07 p15][1.08 p20][1.08 p25][1.09 p30][1.09 p35][1.09 p40][1.09 p45][1.09 p50][1.09 p55][1.09 p60][1.10 p65][1.10 p70][1.10 p75][1.10 p80][1.10 p85][1.11 p90][1.13 p95][1.90 p100]
    read 840 lines: max-packet-net-centos7/m2.xlarge/cache-line-example-2-b.log ;[1.21 p0][1.22 p5][1.22 p10][1.22 p15][1.22 p20][1.23 p25][1.23 p30][1.24 p35][1.24 p40][1.25 p45][1.26 p50][1.26 p55][1.26 p60][1.26 p65][1.27 p70][1.27 p75][1.27 p80][1.27 p85][1.28 p90][1.29 p95][1.99 p100]
    read 720 lines: max-packet-net-ubuntu1604/c2.medium/cache-line-example-2-b.log;[0.99 p0][0.99 p5][0.99 p10][0.99 p15][0.99 p20][0.99 p25][0.99 p30][1.00 p35][1.00 p40][1.00 p45][1.00 p50][1.00 p55][1.00 p60][1.00 p65][1.00 p70][1.01 p75][1.01 p80][1.01 p85][1.01 p90][1.02 p95][1.03 p100]
    read 720 lines: max-packet-net-ubuntu1604/m1.xlarge/cache-line-example-2-b.log;[1.01 p0][1.02 p5][1.04 p10][1.07 p15][1.08 p20][1.08 p25][1.08 p30][1.08 p35][1.09 p40][1.09 p45][1.20 p50][1.23 p55][1.24 p60][1.25 p65][1.25 p70][1.25 p75][1.25 p80][1.25 p85][1.25 p90][1.26 p95][1.57 p100]
    read 840 lines: max-packet-net-ubuntu1604/m2.xlarge/cache-line-example-2-b.log;[1.21 p0][1.21 p5][1.21 p10][1.22 p15][1.22 p20][1.22 p25][1.23 p30][1.23 p35][1.23 p40][1.23 p45][1.24 p50][1.31 p55][1.31 p60][1.31 p65][1.32 p70][1.32 p75][1.35 p80][1.35 p85][1.35 p90][1.36 p95][1.36 p100]

    Notes:

    c2 and m1 have 48 CPUs each including hyper threads, while m2 has 56 CPUs.
    Each benchmark was run 15 times on each CPU.

    I have also got similar results on Amazon testing different OSs. For same reason, Amazon Linux 2 performs consistently much worse than either CentOS 7 or Ubuntu 16.04.

    [1] https://www.packet.net/

  7. Gerg says:

    I don’t understand the comment that you’re adding a “memory copy over a small buffer”. Surely the net effect is to copy the same total amount of data regardless of the size of the buffer.

    The metric that will matter here is the ratio between the time needed to copy all the data and the amount of work being done with all that data. If the data is large and there’s very little work being done on it then copying it all is going to add a lot of extra time. If the data is small and there’s lots of work being done on each datum then the time to copy it all will be insignificant and the cache efficiency will dominate.

    This is especially true if the data structure itself requires lots of effort to look up individual data. If you’re descending a btree for each datum then even if you don’t optimize that for batch processing descending it again for the next one will be much faster if all those addresses are still in cache.

    1. I don’t understand the comment that you’re adding a “memory copy over a small buffer”. Surely the net effect is to copy the same total amount of data regardless of the size of the buffer.

      I am not sure why we don’t understand each other. Yes. I agree that the size of the buffer, as long as it fits in L1 cache, is probably not super important. Did I write something different?