8th December 2008, 9 min read

Parsing text files is CPU bound

9 thoughts on “Parsing text files is CPU bound”

Daniel Lemire says:

December 8, 2008 at 6:21 pm

@KWillets

Compression isn’t just for I/O bandwidth.

I agree.

Getting optimally-sized operands is also a benefit, although I haven’t seen it researched much.

Compression can be used to diminish the number of CPU cycles used. Definitively.

If you change the question from “what is the minimum number of bits needed to store this data” to “what is the minimum number of bits needed to construct the output”, eg a join or intersection, etc., it becomes more interesting.

The minimum size of the output is used a lower bound on the complexity of the problem in algorithmic design. So, it can be used to show that you have an optimal algorithm.
Daniel Lemire says:

December 8, 2008 at 10:15 pm

@Bannister

How big was your input file?

Several GiBs. I work with large files.

You should be running any benchmark more than once to get consistent times. The input file must be bigger than memory so that it is not cached in memory by the operating system.

This is a “top -o cpu” command on another process while the program is running.

My observations are not meant to be scientific, but I can tell you that optimizing my CSV parsing code helped quite a bit speed it up. Had the process been “very I/O bound”, optimizing the code would not have mattered. My code is somewhere on google code (it is open source).

Unless you have an insanely fast disk subsystem (not something you find in usual desktops), reading anything as simple as CSV files should be very I/O-bound, and not CPU-bound.

It should. Shouldn’t it?

I made a blog post out of it because I find all of this puzzling. I’d be glad to see someone do a follow-up analysis. Maybe someone could prove me wrong? I’d like that.
KWillets says:

December 8, 2008 at 5:16 pm

Compression isn’t just for I/O bandwidth. Getting optimally-sized operands is also a benefit, although I haven’t seen it researched much.

If you change the question from “what is the minimum number of bits needed to store this data” to “what is the minimum number of bits needed to construct the output”, eg a join or intersection, etc., it becomes more interesting.
Preston L. Bannister says:

December 8, 2008 at 9:21 pm

How big was your input file?

You should be running any benchmark more than once to get consistent times. The input file must be bigger than memory so that it is not cached in memory by the operating system.

Unless you have an insanely fast disk subsystem (not something you find in usual desktops), reading anything as simple as CSV files should be very I/O-bound, and not CPU-bound.
John Schneider says:

December 9, 2008 at 1:18 pm

Actually, the binary XML people stress both compactness and processing efficiency. Some of the binary XML use cases are CPU bound because they have fast networks that are not congested. Others are I/O bound because they have slow, wireless, or congested networks.

The EXI evaluation document you referenced is an early draft that does not yet include processing efficiency measurements. The next draft will include these measurements. In the mean time, see http://www.w3.org/TR/exi-measurements/ for a complete collection of binary XML compactness and processing efficiency measurements, including some taken over high-speed and wireless networks.

Also, see http://www.agiledelta.com/efx_perffeatures.html for compactness and processing speed measurements from a commercial EXI implementation.

All the best,

John
KWillets says:

December 10, 2008 at 3:31 pm

I can’t find the code.
Matt says:

December 10, 2008 at 4:48 pm

Unless you have an insanely fast disk subsystem (not something you find in usual desktops), reading anything as simple as CSV files should be very I/O-bound, and not CPU-bound.

Please refrain from using words such as “insanely” 🙂 Let’s put a number on it shall we. Suppose we would have one of these new SSD drives doing 200MB/s.
On a 2GHz CPU you would have 10 cycles to process a single byte of data. That is not very much considering the fact that with those 10 cycles you need to handle String encoding (UTF-8, etc, etc), Integer, Number, Date and Boolean encoding.

Looking at these numbers you see WHY reading CSV files is nearly always CPU-bound for high end storage sub-systems.
Preston L. Bannister says:

December 12, 2008 at 3:10 pm

Matt, you will find I am very intentional in my use of language. 🙂

What is the definition of sanity? Sanity is defined relative to what most people do most of the time. (Which might seem a bit odd – but that is another discussion.)

When coming up with a solution, you have to make “sane” assumptions about the design space. You cannot assume unlikely resources (at least not usually). You have to assume what most of target hardware will have, most of the time.

If we are interested in the performance of parsing large files, that would be because we *have* large files. Most likely we have a series of very large files, generated as a part of some repeating process. (Otherwise the problem becomes uninteresting.)

Combine the above two considerations, with the cost of SSD storage, and you have an awkward fit. Also, after poking around a bit on the web, it looks as though the *sustained* read rates for SSD is currently around 40-100MB/s. (We are not interested in burst rates.)

Yes, there have always been high-end (and expensive!) storage systems that deliver much higher than average performance. If you are designing a solution for a general population of machines, then it would be insane to assume performance present in only a very small minority.

So the use of “insanely fast” can carry exactly the right freight when talking about a design. 🙂
Steven says:

December 18, 2008 at 4:54 am

If I had to wager a guess, I’d say that all the memory management for storing the results are to blame – malloc()/free() are notoriously slow, so calling them the hundreds of millions of times a multi-gigabyte file requires is very likely the problem. Try using a static structure for output, or else just discard the output entirely (still computing it of course), and see what happens.