16th December 2008, 1 min read

Parsing CSV files is CPU bound: a C++ test case

In Parsing text files is CPU bound, I claimed that I had a C++ test case proving that parsing CSV files could be CPU bound. By CPU bound, I mean that the overhead of taking each line, finding out where the commas are, and storing the copies of the fields into an array, dominates the running time.

How do I test this theory? I read the file twice. Once, I just read each line and report the time elapsed. Then, I read each line and process them and report the time elapsed. If the two times are similar, the process is I/O bound, if the second time is much larger, the process is CPU bound.

I get this result on a 2 GB file (numbers updated on Dec. 19, 2008):

$ ./parsecsv ./netflix.csv without parsing: 26.55 with parsing: 95.99

Hence, parsing dominates the running time. At least in this case. At least with my C++ code.

Before you start arguing with me, please go download my reproducible test case. All you need is the GNU GCC compiler. I tested out two machines, with two different versions of GCC.

Note: I do not claim that this is professional benchmarking.

Reference: This quest started out from a post by Matt Casters where he reported that you could parse a CSV file faster using two CPU cores instead of just one.