19th December 2008, 2 min read

Parsing CSV files is CPU bound: a C++ test case (Update 1)

In a recent blog post, I said that parsing simple CSV files could be CPU bound. By parsing, I mean reading the data on disk and copying it into an array. I also strip the field values of spurious white space.

You can find my C++ code on my server.

A reader criticized my implementation as follows:

I use the C++ getline function to read the lines. The reader commented that “getline does one heap allocation and copy for every line.” I doubt that getline generates heap allocation each time it is called: I reuse the same string object for every call.
For each field value, I did two heap allocations and two copies. I now reuse the same string objects for fields, thus limiting the number of heap allocations.
The reader commented that I should use a custom allocator to avoid heap allocations. Currently, if the CSV file has x fields, I use x+1 string objects (a tiny number) and small constant number of heap allocations.

Despite these changes, I still get that parsing CSV files is strongly CPU bound:

$ ./parsecsv ./netflix.csv without parsing: 26.55 with parsing: 95.99

However, doing away with the heap allocations at every line did reduce the parsing running time by a factor of two. It is not difficult to believe I could close the gap. But I still see no evidence that parsing CSV files is strongly I/O bound as some of my readers have stated. Consider that in real applications, I would need to convert field values to dates or to numerical values. I might also need to filter values, or support fancier CSV formats.

My experiments are motivated by a post by Matt Casters. Some said that Java was guilty. I use C++ and I get a similar result. So far at least. Can you tell me where I went wrong?

Note: Yet again, I do not claim that my code is nearly optimal. My exact claim is that reading CSV files may be I/O bound using reasonable code. I find this very surprising.