You are most certainly right if you think that memory allocation is guilty. However, I specifically defined “parsing” as “copying the fields into new arrays”.
There is no question that if I just read the bytes and do nothing with them, it is not going to end up being CPU bound, but that is not very realistic of a real application, is it? Copying the fields and storing them into some array appears to me to me a basic operation.
1. tokenize does heap allocation and copy *per value* via token string.
Heap allocation is avoided in latest version.
2. vector of strings also do an *extra* heap allocation and copy to copy construct *each element*.
Fixed this in latest version.
3. Shorter and complete I/O bound (for > then physical memory files) code can be written, using a custom allocator (~10 lines reusable code) that does only amortized 1 heap allocation and memcpy *per file*. Remember that you’re using C++ not Java 🙂 The solution is left as an exercise for the readers 🙂
I wrote in my blog post:
«I do not claim that writing software where CSV parsing is strong I/O is not possible, or even easy.»
vicayasays:
Major performance problems of the code:
0. getline does one heap allocation and copy for every line.
1. tokenize does heap allocation and copy *per value* via token string.
2. vector of strings also do an *extra* heap allocation and copy to copy construct *each element*.
3. Shorter and complete I/O bound (for > then physical memory files) code can be written, using a custom allocator (~10 lines reusable code) that does only amortized 1 heap allocation and memcpy *per file*. Remember that you’re using C++ not Java 🙂 The solution is left as an exercise for the readers 🙂
You are most certainly right if you think that memory allocation is guilty. However, I specifically defined “parsing” as “copying the fields into new arrays”.
There is no question that if I just read the bytes and do nothing with them, it is not going to end up being CPU bound, but that is not very realistic of a real application, is it? Copying the fields and storing them into some array appears to me to me a basic operation.
1. tokenize does heap allocation and copy *per value* via token string.
Heap allocation is avoided in latest version.
2. vector of strings also do an *extra* heap allocation and copy to copy construct *each element*.
Fixed this in latest version.
3. Shorter and complete I/O bound (for > then physical memory files) code can be written, using a custom allocator (~10 lines reusable code) that does only amortized 1 heap allocation and memcpy *per file*. Remember that you’re using C++ not Java 🙂 The solution is left as an exercise for the readers 🙂
I wrote in my blog post:
«I do not claim that writing software where CSV parsing is strong I/O is not possible, or even easy.»
Major performance problems of the code:
0. getline does one heap allocation and copy for every line.
1. tokenize does heap allocation and copy *per value* via token string.
2. vector of strings also do an *extra* heap allocation and copy to copy construct *each element*.
3. Shorter and complete I/O bound (for > then physical memory files) code can be written, using a custom allocator (~10 lines reusable code) that does only amortized 1 heap allocation and memcpy *per file*. Remember that you’re using C++ not Java 🙂 The solution is left as an exercise for the readers 🙂