9th December 2009, 11 min read

Run-length encoding (part 3)

13 thoughts on “Run-length encoding (part 3)”

Kevembuangga says:

December 10, 2009 at 1:56 am

But did you try Burrowsâ€“Wheeler within a single column irrespective of other columns/rows shuffling?
Daniel Lemire says:

December 10, 2009 at 7:57 am

@Kevembuangga
Yes, I thought about it… But reordering the columns independently makes reconstructing single rows a messy adventure.
Kevembuangga says:

December 11, 2009 at 2:38 am

makes reconstructing single rows a messy adventure

Beside having to “guess” the value of the given row items from BW reordered columns how is that more messy than having to unfold the RLE compressed columns up to the row rank you are interested in?
Daniel Lemire says:

December 11, 2009 at 7:39 am

@Kevembuangga Suppose you want to get all row IDs where the first column has value X and the second column has value Y. How do you extract these row IDs quickly, if the columns were reordered independently?
Kevembuangga says:

December 11, 2009 at 11:36 am

How do you extract these row IDs quickly

You would have to run the inverse BWT on each whole column, this is just another stage of decompression, unless there is a very clever hack (on a level of this kind!) to peek the ID values from the encoded BWT.
Do you mean that with simple RLE you can pick the row IDs corresponding to the Xs an Ys somehow by “direct access” from the compressed columns without actually running thru the whole columns?
And don’t you use boolean operations on bit maps to compute such sets of IDs?

Anyway, may be you should forget about Burrows-Wheeler which is overrated due to its good performance on text but bad for Markov sources, which column data probably are most often.
See Most Burrows-Wheeler based compressors are not optimal
H. Kaplan and E. Verbin, CPM’07
Daniel Lemire says:

December 11, 2009 at 12:09 pm

@Kevembuangga

Thank you for the pointers.

You would have to run the inverse BWT on each whole column, this is just another stage of decompression (…)

Who said anything about decompressing the data?

I can scan RLE-compressed data looking for the row IDs I need without ever decompressing them. So, the running time is a linear function of the compressed size of my input. (Times some factor which depends on the number of columns.)

If I first decompress the data, then my processing time will be a function of the real (uncompressed) data size. I save on storage but not on processing time. In fact, the processing time will be longer than if I had worked with uncompressed data (not accounting for possible IO savings).

And I am not even discussing the processing overhead of computing several BWT.
Kevembuangga says:

December 11, 2009 at 1:26 pm

I can scan RLE-compressed data looking for the row IDs I need without ever decompressing them. So, the running time is a linear function of the compressed size of my input.

Mmmmmm… I see, but you still do run thru the whole column, if you were really clever you could have the running time a linear function of ONLY the size of the compressed output (number of rows matching the sought for value(s)):
Instead of storing some kind of map of the row values, store for each possible row value (cardinality) a compressed bit map of the corresponding row IDs.
So whenever you know which values you are looking for you only pick and process the corresponding bit maps.
Compressed sparse bit maps can be cheap…
Kevembuangga says:

December 11, 2009 at 1:33 pm

Typo: “each possible column value” instead of “each possible row value”
Daniel Lemire says:

December 11, 2009 at 2:09 pm

@Kevembuangga

if you were really clever you could have the running time a linear function of ONLY the size of the compressed output

The curse of dimensionality makes this cleverness extremely difficult unless you expect specific queries.

Consider these queries for example:

select X*Y*Z where (X*Y-Z<W); select * where (X*X+Y*Y<Z*Z+W*W+V*V);

Certainly, you can find ways to index them. But if I am free to come up with others, you will eventually get in serious trouble.

The lesson here is that there is not one indexing strategy that will always work.

You may enjoy this earlier blog post:

Understanding what makes database indexes work
http://www.daniel-lemire.com/blog/archives/2008/11/07/understanding-what-makes-database-indexes-really-work/
Kevembuangga says:

December 12, 2009 at 12:35 am

The curse of dimensionality …

Looks like we are not talking about the same topic, never mind.
Daniel Lemire says:

December 12, 2009 at 10:28 am

@Kevembuangga Yes, I think we are talking about the same thing. If you only focus on one column at a time, you can indeed index things up very nicely using a (compressed) B-tree or hash table. But we are dealing with a multidimensional list… a table with several columns… that’s much harder to index.
Kevembuangga says:

December 12, 2009 at 12:25 pm

I think we are talking about the same thing.

Then you missed my point!
I am not proposing a scheme that will resolve the multicolumns sorting conundrum any better than your (or anyone else’s) RLE compression.
What I say is that when working within anyone column during a complex search it is possible to reduce the computation load to be only proportional to the compressed size of the intermediate result (set of row IDs) for this column NOT to the compressed size of the whole column.
Though that is of practical import only if there isn’t a constant multiplicative factor hidden somewhere in the algorithm I envision which will spoil the better asymptotic performance, many a “theoretically best” algorithm is crippled by this.
All this of course assuming that we are still speaking about columns for which a bitmap is to be preferred.
Willfred says:

December 14, 2009 at 7:20 pm

Interesting, simple and effective.