, 2 min read
Parsing text files is CPU bound
Computer Science researchers often stress the importance of compression to get better performance. I believe this is a good illustration of an academic bias. Indeed, file size is easy to measure. It is oblivious to Computer and CPU architectures. We even have a beautiful theory that tells you how far from the optimal compression ratio you stand. And better compression is important: YouTube would not be possible without aggressive video compression.
However, the following theorem is false:
Theorem 1. The time required to parse a file is proportional to its size.
Matt Casters showed that using an open source data warehousing tool, parsing simple CSV files is CPU bound. That is, he gets (slightly) better parsing speed when using two CPU cores to parse the file than a single one. On a strongly I/O bound process, using two threads to read a file would make things worse because it introduces disk random access.
I decided to verify this claim. I have an optimized CSV file parser written in C++. It may not be as fast as it can possibly be, but the 100 lines of C++ are reasonable. It is single-threaded. I launched the script on a large CSV file and, sure enough, the command “top -o cpu” reported the process as using 100% of the CPU (on a 2 dual-core systems). So, yes, parsing CSV files is CPU bound!
This has serious implications. For example, comparable tests will reveal that XML parsing of large files is also CPU bound. Hence, it is a bit strange to see the binary XML people stress that they can compress XML by an average of 10 times, but report little regarding CPU usage.
Note: You mileage may vary. I do not claim that file parsing is always CPU bound. Also, compression is an important technique to accelerate databases.
Update. Here are the file characteristics: number of columns = 4, size in GB = 2.6.