As someone working on algorithms in bioinformatics, I often get the feeling that no computer ever has a sufficient amount of memory. When datasets have billions of elements or more, pointer-based trees and hash tables can’t always be used, because the pointers would require too much memory. Data is often stored in static arrays with batch updates, because dynamic updates would require too much space overhead.
My typical “map-like” data structures are based on (plain or compressed) bitmaps with rank/select support. The data is stored in an array, sorted by integer keys. One bitmap records the key values that are present, while another marks the entries that have a different key than the previous entry. Searching involves one rank query, one select query, and a constant number of cache misses.
Simonsays:
I totally agree, Journi.
For example a tree structure holding all IPv4 ranges is very expensive in pointers and navigation. However, each range is two 32bit numbers. By reducing all ranges to max 256, and by using the first 24 bits of the start range as an index — think of it like a perfect hash — then we end up with a fixed sized index of 2^24 * 32 bits or 64 MB. So you can now do one random read whereas a tree would have taken up to 24 reads. Dealing with the last 8 bits of the IPv4 is left as an exercise for the reader 🙂 The final data structure is 30 times smaller than the tree.
This is a good example where using a built in tree or hash map “off the shelf” would cause unnecessarily high memory overhead.
“As someone working on algorithms in bioinformatics, I often get the feeling that no computer ever has a sufficient amount of memory.”
You and I both work on compressed data structures and corresponding indexes… and this work has certainly its place.
However, most organizations can easily afford a machine with 128 GB of memory or more… and for the rest, you can use SSDs. Meanwhile, actual datasets are often not that huge. The human genome fits in 2 GB.
I am not saying that saving memory is pointless, if I thought so, I would not be benchmarking memory usage… but we should not exaggerate the importance of memory as a bottleneck. Having lots of data, but processing it in a predictable way is fine…
Simonsays:
I beg to disagree regarding how common huge datasets are. Very many organizations these days do “big data” and they have some sort of data workflow set up that takes a team weeks or months to implement. However, often it’s necessary to venture off the well beaten data workflow path in order to explore a portion of the data for some reason. Here the 128 GB RAM system suddenly becomes a limitation because you want to process the sample data in minutes or hours rather than days.
This is where awareness of algorithm and memory usage is very important. For example, most languages support some kind of hash table. But if you want to serialize it to disk then it normally involves iterating over the number of keys in the hash table. So 100 million keys means 100 million iterations. However, Perl has a module called Storable which can dump a data structure to disk without having to iterate over all nodes. This makes it faster than possibly all other comparable languages for this particular task. Developers should always question the status quo!
An assembled human genome is small, but the actual datasets are much larger. The genome comes out from a sequencing machine as a billion sequences of length ~100 each, accompanied by another 100 gigabytes of quality/metadata. Sequencing has become cheap enough that research projects sequencing hundreds or even thousands of individuals are everywhere. The amount of raw data in a single project is now often in hundreds of terabytes.
Bioinformatics is one of the few fields, where compressed data structures are used everywhere. Maybe it’s an attempt to compensate for the lack of suitable hardware with better software. After all, world’s high-performance computing infrastructure appears to be optimized for processing small numerical/categorical objects, while we have large combinatorial objects such as sequences and graphs.
I did some research recently and found that koloboke is one of the best libraries in terms of both memory usage and retrieval efficiency.
BTW, I have also re-written a crucial piece of my Java pipeline in C++. It became 6 times faster. Memory usage also halved at the very least (more like 1/3 I think).
One big problem in Java is that it’s hard to avoid memory allocations. In C++, certain things can be made virtually allocation free. Another problem is that all standard math is double precision. Why? You don’t always need doubles.
– Provide capacity information at initialization time (seemed reasonable given the array case had that information)
– Add Fastutil’s Int2IntArrayMap, Int2IntRBTreeMap and Int2IntAVLTreeMap
– Add int[] (ie primitive array)
The results were int[] and Fastutil’s Int2IntArrayMap consumed 8 bytes per entry, which is the optimal native (uncompressed) size expected given an int is 32 bits in Java (ie 4 bytes per int key + 4 bytes per int value).
The advantage of Int2IntArrayMap is it implements the Map interface abstraction while still providing dynamic resizing. This makes it simpler than directly using an int[] alternative.
Failure to specify a construction-time initialization size for Int2IntArrayMap reduces its space efficiency, but never to worse than Int2IntOpenHashMap (which is 13.1 bytes/entry). While the Int2IntArrayMap and int[] are the most space efficient, it comes with linear scanning access time costs. As an aside, Fastutil also offers “big arrays”, which can be useful if your arrays are so large they would exceed an integer-based index.
The Fastutil Red-Black and AVL tree maps both came in at 32 bytes per entry. As such they’re still more efficient than any JDK provided version.
Those interested in map performance benchmarks might like to visit https://github.com/mikvor/hashmapTest/issues/3 for the latest hash map performance tests. It compares many implementations, including Koloboke and Fastutil.
Similarly if hashing performance is critical to your problem domain (eg Fastutil has various Object maps which can accept a custom hasher), I’ve published graphs of 114 Java hash, CRC and checksum implementations at https://github.com/benalexau/hash-bench.
As someone working on algorithms in bioinformatics, I often get the feeling that no computer ever has a sufficient amount of memory. When datasets have billions of elements or more, pointer-based trees and hash tables can’t always be used, because the pointers would require too much memory. Data is often stored in static arrays with batch updates, because dynamic updates would require too much space overhead.
My typical “map-like” data structures are based on (plain or compressed) bitmaps with rank/select support. The data is stored in an array, sorted by integer keys. One bitmap records the key values that are present, while another marks the entries that have a different key than the previous entry. Searching involves one rank query, one select query, and a constant number of cache misses.
I totally agree, Journi.
For example a tree structure holding all IPv4 ranges is very expensive in pointers and navigation. However, each range is two 32bit numbers. By reducing all ranges to max 256, and by using the first 24 bits of the start range as an index — think of it like a perfect hash — then we end up with a fixed sized index of 2^24 * 32 bits or 64 MB. So you can now do one random read whereas a tree would have taken up to 24 reads. Dealing with the last 8 bits of the IPv4 is left as an exercise for the reader 🙂 The final data structure is 30 times smaller than the tree.
This is a good example where using a built in tree or hash map “off the shelf” would cause unnecessarily high memory overhead.
“As someone working on algorithms in bioinformatics, I often get the feeling that no computer ever has a sufficient amount of memory.”
You and I both work on compressed data structures and corresponding indexes… and this work has certainly its place.
However, most organizations can easily afford a machine with 128 GB of memory or more… and for the rest, you can use SSDs. Meanwhile, actual datasets are often not that huge. The human genome fits in 2 GB.
I am not saying that saving memory is pointless, if I thought so, I would not be benchmarking memory usage… but we should not exaggerate the importance of memory as a bottleneck. Having lots of data, but processing it in a predictable way is fine…
I beg to disagree regarding how common huge datasets are. Very many organizations these days do “big data” and they have some sort of data workflow set up that takes a team weeks or months to implement. However, often it’s necessary to venture off the well beaten data workflow path in order to explore a portion of the data for some reason. Here the 128 GB RAM system suddenly becomes a limitation because you want to process the sample data in minutes or hours rather than days.
This is where awareness of algorithm and memory usage is very important. For example, most languages support some kind of hash table. But if you want to serialize it to disk then it normally involves iterating over the number of keys in the hash table. So 100 million keys means 100 million iterations. However, Perl has a module called Storable which can dump a data structure to disk without having to iterate over all nodes. This makes it faster than possibly all other comparable languages for this particular task. Developers should always question the status quo!
An assembled human genome is small, but the actual datasets are much larger. The genome comes out from a sequencing machine as a billion sequences of length ~100 each, accompanied by another 100 gigabytes of quality/metadata. Sequencing has become cheap enough that research projects sequencing hundreds or even thousands of individuals are everywhere. The amount of raw data in a single project is now often in hundreds of terabytes.
Bioinformatics is one of the few fields, where compressed data structures are used everywhere. Maybe it’s an attempt to compensate for the lack of suitable hardware with better software. After all, world’s high-performance computing infrastructure appears to be optimized for processing small numerical/categorical objects, while we have large combinatorial objects such as sequences and graphs.
I did some research recently and found that koloboke is one of the best libraries in terms of both memory usage and retrieval efficiency.
BTW, I have also re-written a crucial piece of my Java pipeline in C++. It became 6 times faster. Memory usage also halved at the very least (more like 1/3 I think).
One big problem in Java is that it’s hard to avoid memory allocations. In C++, certain things can be made virtually allocation free. Another problem is that all standard math is double precision. Why? You don’t always need doubles.
I published a fork at https://github.com/benalexau/HashVSTree which contains a few enhancements:
– Provide capacity information at initialization time (seemed reasonable given the array case had that information)
– Add Fastutil’s Int2IntArrayMap, Int2IntRBTreeMap and Int2IntAVLTreeMap
– Add int[] (ie primitive array)
The results were int[] and Fastutil’s Int2IntArrayMap consumed 8 bytes per entry, which is the optimal native (uncompressed) size expected given an int is 32 bits in Java (ie 4 bytes per int key + 4 bytes per int value).
The advantage of Int2IntArrayMap is it implements the Map interface abstraction while still providing dynamic resizing. This makes it simpler than directly using an int[] alternative.
Failure to specify a construction-time initialization size for Int2IntArrayMap reduces its space efficiency, but never to worse than Int2IntOpenHashMap (which is 13.1 bytes/entry). While the Int2IntArrayMap and int[] are the most space efficient, it comes with linear scanning access time costs. As an aside, Fastutil also offers “big arrays”, which can be useful if your arrays are so large they would exceed an integer-based index.
The Fastutil Red-Black and AVL tree maps both came in at 32 bytes per entry. As such they’re still more efficient than any JDK provided version.
Those interested in map performance benchmarks might like to visit https://github.com/mikvor/hashmapTest/issues/3 for the latest hash map performance tests. It compares many implementations, including Koloboke and Fastutil.
Similarly if hashing performance is critical to your problem domain (eg Fastutil has various Object maps which can accept a custom hasher), I’ve published graphs of 114 Java hash, CRC and checksum implementations at https://github.com/benalexau/hash-bench.
Fastutil also offers “big arraysâ€, which can be useful if your arrays are so large they would exceed an integer-based index.
For these times when you really need an array that exceeds 8 GB.