There are a couple of things that might speed things up. One is to avoid conventional I/O and mmap the file. Then you need a small DFA to parse the CSV format in memory.
Eventually you should get an inner loop that looks something like:
while( state = dfa[state].edges[*p++] )
;
(State 0 would be the end/error state.)
There are a few tricks to doing this, like padding the end of the mmap with sentinel characters to drive the DFA into the end state and end the loop. Or you can add a counter to bounds-check; the cpu may be able to ILP the extra instructions.
I once did something similar, with a trie as the DFA, and it was quite fast. There are two main factors, IMHO, which are important: low instruction count, and low branch count. If your inner loop is constantly checking if it’s at the end of the input buffer, and checking other loop counters or termination conditions, the CPU’s branch prediction will degrade severely.
You can also get into SSE instructions, and there are a few things with cache management that would probably be relevant.
To back up my earlier assertion, I wrote and tested a simple CSV-parser. The results exactly match my assertion: an efficient CSV parser is very much I/O-bound (not CPU-bound).
@KWillets
Er, did you benchmark using memory-mapped files? Last time I did, the results I got (on both Linux and Windows) was SLOWER than simple sequential file access.
This makes sense.
Memory-mapped file access is optimized for random access. Sequential file access is optimized for sequential access. Operating systems do sneaky things under the covers to optimize sequential I/O (a VERY common case).
(Not the first time I’ve run across this myth! Clearly not enough folks write benchmarks and collect measurements.)
Your code is indeed grossly inefficient. Furthermore, use of memory mapped files can actually reduce most of the I/O bound issues as processing of such files is typically a sequential task if done properly/correctly.
To see how it should be done correctly feel free to check out the following article:
There are a couple of things that might speed things up. One is to avoid conventional I/O and mmap the file. Then you need a small DFA to parse the CSV format in memory.
Eventually you should get an inner loop that looks something like:
while( state = dfa[state].edges[*p++] )
;
(State 0 would be the end/error state.)
There are a few tricks to doing this, like padding the end of the mmap with sentinel characters to drive the DFA into the end state and end the loop. Or you can add a counter to bounds-check; the cpu may be able to ILP the extra instructions.
I once did something similar, with a trie as the DFA, and it was quite fast. There are two main factors, IMHO, which are important: low instruction count, and low branch count. If your inner loop is constantly checking if it’s at the end of the input buffer, and checking other loop counters or termination conditions, the CPU’s branch prediction will degrade severely.
You can also get into SSE instructions, and there are a few things with cache management that would probably be relevant.
To back up my earlier assertion, I wrote and tested a simple CSV-parser. The results exactly match my assertion: an efficient CSV parser is very much I/O-bound (not CPU-bound).
The results and reference to the sources are at:
http://bannister.us/weblog/2008/12/21/performance-parsing-csv-data/
@KWillets
Er, did you benchmark using memory-mapped files? Last time I did, the results I got (on both Linux and Windows) was SLOWER than simple sequential file access.
This makes sense.
Memory-mapped file access is optimized for random access. Sequential file access is optimized for sequential access. Operating systems do sneaky things under the covers to optimize sequential I/O (a VERY common case).
(Not the first time I’ve run across this myth! Clearly not enough folks write benchmarks and collect measurements.)
@Bannister
Thank you. Maybe I will test out memory-mapped file later. For fun.
I think that more open discussions about these issues is important.
By “open” I mean “with open code”.
Yes, the memory-mapped version was benchmarked against regular i/o and was faster at the time, 6-7 years ago. I don’t know why either.
Your code is indeed grossly inefficient. Furthermore, use of memory mapped files can actually reduce most of the I/O bound issues as processing of such files is typically a sequential task if done properly/correctly.
To see how it should be done correctly feel free to check out the following article:
http://www.codeproject.com/Articles/23198/C-String-Toolkit-StrTk-Tokenizer
@Arash
Thanks for the link to your article.