Daniel Lemire's blog

, 17 min read

The big-load anti-pattern

16 thoughts on “The big-load anti-pattern”

  1. --- says:

    roughly the size of your CPU cache

    That’s often too small, although if you’re talking about L3 caches on many present day CPUs, it’s probably okay. The cost of context switching and I/O is significantly worse than RAM speed on most setups, so you often want to tune for I/O as opposed to CPU cache, such as making large read requests to disk (particularly if it’s a harddisk).

    What’s probably more important is doing async I/O, so your processing can occur whilst the disk is pulling in data for you. Also, avoiding random I/O, as sequential I/O is generally fastest (i.e. be careful with reading too many things at the same time).

    Since I/O costs typically dwarf RAM speeds, large files are better than smaller ones (lowers random access costs and performance penalties with opening files) – it’s actually why many games go to the effort of packing all their assets into archives because it’s faster to deal with one big file than many smaller ones.

    1. Yes, using many tiny files would be a performance anti-pattern of its own (which I refer to as “small-load”).

      Note that in my context, I assume a high-performance SSD. If you are using a spinning drive, then you may be limited to a few megabytes per second (if that) and my whole post is irrelevant.

      1. --- says:

        It’s not just many tiny files, any splitting can slow you down because each file you have incurs overhead with opening/closing it (and can reduce sequential accesses because filesystems are less likely to place these specific files sequentially on disk). Granted, if the files are large enough, the cost is pretty small, but it doesn’t mean you should arbitrarily introduce splitting with the assumption that it improves performance.

        Most SSDs are limited by SATA bandwidth, so around 500MB/s max. If you’re using a cloud hosting platform, you’ll also be limited by network bandwidth, so possibly even less.
        Even if you’re using a high performance modern NMVe drive, you’re still stuck with needing to do kernel traps when issuing reads.

        As such, the point stands that you shouldn’t optimise based on CPU cache, but on whatever is most efficient for I/O. 1GB is likely too big, but 1MB is likely too small.
        SSDs obviously have different performance characteristics to HDDs, but fortunately, for sequential access, the guidelines for the two are the same.

        1. I take your point that many people’s IO is limited to megabytes per second. Certainly, my home Internet peaks at about 50 MB/s (download).

          On the latest servers I purchased, I am measuring bandwidths for sequential reads that are > 3 GB/s (using standard methodologies). AWS provides gigabytes of network bandwidth (I measured over 4 GB/s for large files in node-to-node transfer).

          1. --- says:

            AWS scales available network bandwidth based on your instance size. All but the few largest instance sizes will not achieve GB/s network bandwidth.

            Also note that you’re measuring sequential I/O. This is usually achievable if everything is in one file, however, if it’s spread out across multiple files, you’re less likely to get fully sequential I/O.

            1. AWS scales available network bandwidth based on your instance size. All but the few largest instance sizes will not achieve GB/s network bandwidth.

              That’s true. Small instances also have relatively little RAM (e.g., 4 GB per core).

              however, if it’s spread out across multiple files, you’re less likely to get fully sequential I/O.

              This is absolutely true and worth noting.

  2. Jens Alfke says:

    Memory-mapping the file (e.g. with mmap) can give you the best of both worlds. You can code as though the file were all in memory, which is simpler, and you can even backtrack or peek forward if necessary. But the computer doesn’t have to copy the whole file into RAM.

    (Depending on the OS, you may lose some performance because the file is read in smaller chunks, possibly as small as 4KB. But in practice the kernel will usually see your access pattern and read ahead in bigger chunks, which btw gives you async I/O for free.)

    1. Memory mapping is a great approach but underneath it is limited by the same fundamental principles. It does not bypasses CPU cache and RAM.

      (I am aware that you know this, but I am clarifying it.)

      1. Travis Downs says:

        I would argue that memory mapping is fundamentally different than the big allocate-read-process approach. Effectively it is like your suggested streaming approach: the memory only takes one trip to the inner circle.

        Consider the case where the mapped pages aren’t in the page cache. You map the pages, and only a small structure is created in the kernel recording the mapping (a so-called VMA in the case of Linux). Then you start processing the region: each time you get to a new 4k page, the kernel will map in that page, including it reading from storage. Then, in userspace the page will be hot in cache.

        Some of the details may vary (e.g., the kernel may “read ahead” more than 4k page to make the IO faster), but these generally make things better, not worse.

        You can do the same analysis for the case were the page are already hot in the page cache, and again this method is competitive with streaming.

        Or, as you usually prefer, you can test this. I have found the mmap approach generally somewhat faster than the read()-based streaming approach, but this varies a bit by OS, kernel version, etc. Here’s a bit more on mmap vs read() — more from the side of why mmap isn’t just way better than read() but rather competitive with the winner decided by smaller factors.

        1. @Travis

          Suppose that I have a CSV file. It is huge. I memory map it. Then I load the CVS file into a tabular data structure, with parsed entries (think excel spreadsheet).

          How does it help that the file was memory mapped?

          1. But I agree, of course, that if you are reading the file, say line by line, a memory map may be effectively equivalent to code that looks like line-by-line reading. (I would expect the memory map to be less efficient in practice though it does not have to be.)

            What I am warning people against is the belief that memory mapping is somehow intrinsically different from loading the data into memory.

          2. Travis Downs says:

            Sure, let’s be explicit about it using your example.

            Let’s read a spreadsheet of size N on a system with a single cache level of size C. As I understand it, your point is that “loading” (reading from disk and/or page cache) the whole file, prior to processing the entire file is slow because most bytes in the file must be brought into RAM at least twice, once when the file is loaded and once when it is processed.

            To be even more explicit, in pseudo-code, the “big load” scenario looks something like this, I think:

            // allocate a buffer of size N (size of the file)
            // we assume here this doesn't do much
            // work, it doesn't touch N bytes!
            byte buffer[] = allocate(file.size);

            // read the entire file into the buffer
            // this necessarily brings all N bytes of the
            // file into RAM, but only at most C bytes
            // can remain in cache, since that's the size
            // of the cache
            read(file.path, buffer);

            // now, parse the array of bytes into the
            // in-memory DOM or whatever you want
            // to call it. This necessarily touches all N
            // bytes again, so brings at least N - C bytes
            // into cache. We assume N >> C in the "big"
            // scenario, so N - C ~= N.
            spreadsheet_DOM = parse(buffer);

            Does that look right?

            As an alternative, you suggest that you process the file in chunks, which might look like:

            byte buffer[C / 2]; // a buffer half the size of cache

            // start with empty DOM
            spreadsheet_DOM = empty();

            for (c = 0; c < file.size; c += C / 2) {
            // read C / 2 bytes
            read(file.path, buffer);
            // then parse those new bytes into cells which
            // we parse and add to the spreadsheet
            // the new cells in the buffer
            spreadsheet_DOM.add_cells(buffer);
            }

            Here, we only bring bytes into memory once, in the read call, they stay in cache for the subsequent add_cells call. Performance will be much better (surprisingly, perhaps even better than 2x better as your test shows).

            So we are calling the second case “fundamentally different” than the first right? You say that using mmap will be like the first in the fundamental sense of memory movement, and I say it will be like the second, right?

            We might as well agree on that before looking at the mmap-the-entire-file scenario.

            1. Travis Downs says:

              Sorry, when I said “must be brought into RAM at least twice” I meant “brought from RAM to the core at least twice”.

            2. No I do not agree with the position you attribute to me. I agree with the position you attribute to yourself.

              You are missing the step here where you actually use the data. You are only deserializing. So try adding a step where you sum up the columns for example.

              1. Travis Downs says:

                I see. Yes, then I agree: if you have more than 1 processing operation (including here de-serialization in that category) that wants to iterate over the in-memory data (including output from a previous step), mmap won’t help you avoid bringing the entire data in at least once. It only helps for the phase directly following the mmap’ing (which I think is all you can ask from it).

                In that sense it is different: if you can organize your processing, including deserialization, into a single-pass, mmaping the entire file will allow the whole thing to be streaming, while read()ing into a large buffer won’t. So my point stands if you are able to write a deserialize_and_sum() operation that works on a large buffer in a single pass.

  3. Indeed, so much so that even for data that is processed by looking at other data in the same file (in your spreadsheet example, another column), we look at how far ahead/behind we need to look, and read a window of data only that wide and then read single columns and process what is in memory.

    We typically perform 2-5 different mathematical data processes on each column of data, and each is performed in it’s own thread, so we observe some multi-processing taking place even against a slow disk read and write at each end.

    Not only does this approach allow us to deal with data sets that are as large as can fit on disk, it also allows us to accurately predict memory requirements up front (and keeps them very small).