19th December 2008, 8 min read

Parsing CSV files is CPU bound: a C++ test case (Update 1)

6 thoughts on “Parsing CSV files is CPU bound: a C++ test case (Update 1)”

Daniel Lemire says:

December 19, 2008 at 10:59 pm

You’re still doing two heap allocations (one for string object, one for actually char buffer)

No. I do not. Here is the code:

c.resize(fieldlength); c.assign(str,lastPos, fieldlength);

Also vector resizing is slow (O(n)).

No, absolutely not. Vector resize is amortized O(1).

Let me throw around a few numbers: lzop compression is 150MB/s

I think not. More like 40 MB/s.

$ time lzop -p ./netflix.csv
real 0m50.057s user 0m22.245s sys 0m4.338s

Try harder.

Care to provide actual code? Or provide a patch for my code?
vicaya says:

December 19, 2008 at 7:06 pm

But “despite these changes” is only one minor change to reduce one string alloc/copy per field. You’re still doing two heap allocations (one for string object, one for actually char buffer) and one copy *per field*, which is a far cry from the amortized 1 alloc + copy *per file* goal. Every heap allocation involves a mutex lock/unlock, which is expensive even without contention. Also vector resizing is slow (O(n) and need to reconstruct all string elements. So depending on number of strings in the vector you doing 2 heap allocations and a copy for every element multiple times during a run.

2GB file in 26.55s is 75MB/s, which is about right for top speed of sequential read on a single modern hard-drive not doing anything else (otherwise it’ll slow down dramatically.) 2GB in 96s, which is about 21MB/s, is very slow for parsing. Let me throw around a few numbers: lzop compression is 150MB/s, decompression can exceed 500MB on a 2.6Ghz C2D processor (see benchmarks on http://www.quicklz.com/).

I’ve written an indexer (including html parsing, word-breaking and constructing small in memory inverted index (the in memory indices indeed needs to small enough to fit in to the L2 cache otherwise speed drops down dramatically) that has a throughput near 100MB/s.

Try harder.
Daniel Lemire says:

December 20, 2008 at 12:34 am

try to compress something completely fit into memory multiple times and make sure it doesn’t write to disk (pipe to /dev/null)

I gained two seconds:

$ time lzop -c ./netflix.csv > /dev/null
real 0m47.757s user 0m22.230s sys 0m2.577s

vector resize is definitely not amortized O(1) more like O(log n) (because every time it resizes, if the storage is not enough, the storage is doubled) (What I meant by O(n) is that when the storage is not enough the resize operation is O(n): you need to allocate new storage and copy construct all the elements on new storage and delete the old storage. You have to do that O(log n) times.)

What is large is the number of rows. Not the field size. But let us say that n is the size of a field. Ok, but n is no larger than 16. So, ok, log n = 4. Four (4) is a fixed number. I have millions of rows. If you amortize 4 over millions of row it no longer matters.

Change your vector to deque should make your program a lot faster even w/o custom allocator, which is required for the next speed up.

I do not see why this would be true. The vector contains 4 strings. It is hard to beat a simple vector when you are storing only 4 strings.

Make sure you compile your code with -O3 if you’re using g++

-O3 makes things slower.
vicaya says:

December 20, 2008 at 12:06 am

What’s the size of the physical RAM on your box? your lzop number is certainly bound by your I/O speed. Try to compress something completely fit into memory multiple times and make sure it doesn’t write to disk (pipe to /dev/null)

vector resize is definitely not amortized O(1) more like O(log n) (because every time it resizes, if the storage is not enough, the storage is doubled) (What I meant by O(n) is that when the storage is not enough the resize operation is O(n): you need to allocate new storage and copy construct all the elements on new storage and delete the old storage. You have to do that O(log n) times.)

Change your vector to deque should make your program a lot faster even w/o custom allocator, which is required for the next speed up.

Make sure you compile your code with -O3 if you’re using g++
vicaya says:

December 20, 2008 at 1:31 am

Sorry, I misread your code. I thought you put all the fields in the vector instead of just doing a simple split 🙂 Your code should be a lot faster than 20MB/s on modern hardware.

Can you post your platform specs? uname -a, gcc -v? CPU?

Your numbers just don’t jive with my experience. Your own code runs faster with -O3 on a 2 year old Macbook Pro:

$ ./po2 largeuni.csv
without parsing: 0.107499
with parsing: 0.256668

$ ./po3 largeuni.csv
without parsing: 0.103201
with parsing: 0.245105

If you want to get to the bottom of it can generate a small set of CSV data (say 10MB, enough for testing parsing speed) that everyone can verify, I’ll take stab on the code.
Daniel Lemire says:

December 20, 2008 at 12:23 pm

Can you post your platform specs? uname -a, gcc -v? CPU?

I use a standard Mac Pro desktop. The specs are described in this recent paper:

* Owen Kaser, Daniel Lemire, Kamel Aouiche, Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes, DOLAP 2008, 2008.
http://arxiv.org/abs/0808.2083

If you want to get to the bottom of it can generate a small set of CSV data (say 10MB, enough for testing parsing speed) that everyone can verify, I’ll take stab on the code.

Just generate any large CSV file. If you use something “typical” (small fields, about 4 of them, and lots of rows), it should not matter which file you use exactly.