Daniel Lemire's blog

, 8 min read

Parsing CSV files is CPU bound: a C++ test case (Update 1)

6 thoughts on “Parsing CSV files is CPU bound: a C++ test case (Update 1)”

  1. You’re still doing two heap allocations (one for string object, one for actually char buffer)

    No. I do not. Here is the code:


    c.resize(fieldlength);
    c.assign(str,lastPos, fieldlength);

    Also vector resizing is slow (O(n)).

    No, absolutely not. Vector resize is amortized O(1).

    Let me throw around a few numbers: lzop compression is 150MB/s

    I think not. More like 40 MB/s.

    $ time lzop -p ./netflix.csv

    real 0m50.057s
    user 0m22.245s
    sys 0m4.338s

    Try harder.

    Care to provide actual code? Or provide a patch for my code?

  2. vicaya says:

    But “despite these changes” is only one minor change to reduce one string alloc/copy per field. You’re still doing two heap allocations (one for string object, one for actually char buffer) and one copy *per field*, which is a far cry from the amortized 1 alloc + copy *per file* goal. Every heap allocation involves a mutex lock/unlock, which is expensive even without contention. Also vector resizing is slow (O(n) and need to reconstruct all string elements. So depending on number of strings in the vector you doing 2 heap allocations and a copy for every element multiple times during a run.

    2GB file in 26.55s is 75MB/s, which is about right for top speed of sequential read on a single modern hard-drive not doing anything else (otherwise it’ll slow down dramatically.) 2GB in 96s, which is about 21MB/s, is very slow for parsing. Let me throw around a few numbers: lzop compression is 150MB/s, decompression can exceed 500MB on a 2.6Ghz C2D processor (see benchmarks on http://www.quicklz.com/).

    I’ve written an indexer (including html parsing, word-breaking and constructing small in memory inverted index (the in memory indices indeed needs to small enough to fit in to the L2 cache otherwise speed drops down dramatically) that has a throughput near 100MB/s.

    Try harder.

  3. try to compress something completely fit into memory multiple times and make sure it doesn’t write to disk (pipe to /dev/null)

    I gained two seconds:


    $ time lzop -c ./netflix.csv > /dev/null

    real 0m47.757s
    user 0m22.230s
    sys 0m2.577s

    vector resize is definitely not amortized O(1) more like O(log n) (because every time it resizes, if the storage is not enough, the storage is doubled) (What I meant by O(n) is that when the storage is not enough the resize operation is O(n): you need to allocate new storage and copy construct all the elements on new storage and delete the old storage. You have to do that O(log n) times.)

    What is large is the number of rows. Not the field size. But let us say that n is the size of a field. Ok, but n is no larger than 16. So, ok, log n = 4. Four (4) is a fixed number. I have millions of rows. If you amortize 4 over millions of row it no longer matters.

    Change your vector to deque should make your program a lot faster even w/o custom allocator, which is required for the next speed up.

    I do not see why this would be true. The vector contains 4 strings. It is hard to beat a simple vector when you are storing only 4 strings.

    Make sure you compile your code with -O3 if you’re using g++

    -O3 makes things slower.

  4. vicaya says:

    What’s the size of the physical RAM on your box? your lzop number is certainly bound by your I/O speed. Try to compress something completely fit into memory multiple times and make sure it doesn’t write to disk (pipe to /dev/null)

    vector resize is definitely not amortized O(1) more like O(log n) (because every time it resizes, if the storage is not enough, the storage is doubled) (What I meant by O(n) is that when the storage is not enough the resize operation is O(n): you need to allocate new storage and copy construct all the elements on new storage and delete the old storage. You have to do that O(log n) times.)

    Change your vector to deque should make your program a lot faster even w/o custom allocator, which is required for the next speed up.

    Make sure you compile your code with -O3 if you’re using g++

  5. vicaya says:

    Sorry, I misread your code. I thought you put all the fields in the vector instead of just doing a simple split 🙂 Your code should be a lot faster than 20MB/s on modern hardware.

    Can you post your platform specs? uname -a, gcc -v? CPU?

    Your numbers just don’t jive with my experience. Your own code runs faster with -O3 on a 2 year old Macbook Pro:

    $ ./po2 largeuni.csv
    without parsing: 0.107499
    with parsing: 0.256668

    $ ./po3 largeuni.csv
    without parsing: 0.103201
    with parsing: 0.245105

    If you want to get to the bottom of it can generate a small set of CSV data (say 10MB, enough for testing parsing speed) that everyone can verify, I’ll take stab on the code.

  6. Can you post your platform specs? uname -a, gcc -v? CPU?

    I use a standard Mac Pro desktop. The specs are described in this recent paper:

    * Owen Kaser, Daniel Lemire, Kamel Aouiche, Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes, DOLAP 2008, 2008.
    http://arxiv.org/abs/0808.2083

    If you want to get to the bottom of it can generate a small set of CSV data (say 10MB, enough for testing parsing speed) that everyone can verify, I’ll take stab on the code.

    Just generate any large CSV file. If you use something “typical” (small fields, about 4 of them, and lots of rows), it should not matter which file you use exactly.