12th September 2012, 18 min read

Fast integer compression: decoding billions of integers per second

24 thoughts on “Fast integer compression: decoding billions of integers per second”

oscarbg says:

September 13, 2012 at 9:59 am

Hi,
pretty interesting.. altough I don’t know much about this topic I was thinking if this scheme is compatible with wider vector sets and perform well with new Intel AVX2 (256 bit wide integer instructions) and BMI2 instructions which would bring 2x wider vectorization.. and also what about Intel Xeon Phi with 512 bit wide vector instructions.. would new algorithms be thought?
Only desiring for you to share your thoughts on the effect of upcoming vector instructions for integer compression..
thanks..
Daniel Lemire says:

September 13, 2012 at 10:34 am

@oscarbg

Thanks for your excellent comment. We do discuss the issue of wider vectors in the paper, see section 7 (page 22).

Reference: http://arxiv.org/abs/1209.2137
Itman says:

September 13, 2012 at 9:35 pm

@oscarbg
Unlike varint it uses only simple SSE2 instructions: 4-int additions, ORs, and ANDs. I believe these instructions should be efficiently supported by all CPUs that have SIMD operations.
oscarbg says:

September 13, 2012 at 11:09 pm

thanks for responses..
omar says:

September 17, 2012 at 1:00 am

Hi Daniel, very interesting work here from what I can understand. Question for you, does this finding have an effect on how data could be transmitted through cell towers to smart phones? Essentially I want to know if this finding implies it would be twice as easy to transmit let’s say a youtube video to a mobile device? Where do you see the upper bounds of compression going in the near future compared to where it is today? Thanks.
Daniel Lemire says:

September 17, 2012 at 8:18 am

@omar

1) Our work is mostly applicable to in-memory databases and search engines. If you are sending data over a network, then pure CPU decoding speed is less important. The goal is our work is to decode the data very, very fast even if it means having a lesser compression. That’s because we assume that the data to be decoding is already in memory.

2) As for…

Where do you see the upper bounds of compression going in the near future compared to where it is today?

We can still improve matters quite a bit. Despite 40 years of research, there is still plenty of room to compress better and faster.
Martin says:

September 30, 2012 at 11:34 pm

Hi Daniel,

I have a question with regard to what you mention in your paper for the binary packing algorithm by Willhalm et al. You say, that they need at least 11 assembly instructions and then you don’t consider this at all anymore. However, my assumption would be that it’s not only the number of instructions but as well their latency etc.

If I look at the results in their paper their decoding speed is at least the same as yours.

For 7 bit/int you get 2800 mis and Willhalm et al get ~300ms for 1000 mis => ~3000 mis.

Is there a specific reason you left it out in your observations? Or did I misunderstand something completely and their approach is simply not viable in your benchmark.

Thanks,
Martin
Daniel Lemire says:

October 1, 2012 at 8:21 am

@Martin

If you want to deduce speed from their Figure 11, then you need to look at our Figure 8 for comparison. We reach 6000 mis on bit packing alone.

The numbers we report (outside our Fig. 8) *include* delta decoding which, as we stress in our paper, is a significant cost.
Itman says:

October 1, 2012 at 8:35 am

@Martin

Thank you for pointing it out. Yes, comparing just the number of assembly instructions is a bit naive. We might even need to implement this method in the future. Note, however, that

1) The latencies in Willhalm et al algorithms are same or higher. They are doing very similar things except that they have the shuffle operation and an additional memory access for the shuffle table. Essentially, their method is varint + a few extra masks and shifts. And we know that varint is slower.

2) As Daniel pointed out, the speeds are not directly comparable. I would elaborate on this. First, deltas can entail a significant cost. It is possible to incorporate them better, but this wasn’t done in the current work and is planned for the follow up article. Second, their bit case is not equivalent to ours. They only decode integers that fit into B bits exactly. In our case, it is B bits ON AVERAGE. They did not implement switching about different bitcases in a single integer array.
Daniel Lemire says:

October 1, 2012 at 8:47 am

@Martin

For bit widths less than 8, our unpacking speed is always higher than 6000 mis (see Fig. 8b) whereas they have half this speed (3300 mis) when the bit width is 6: indeed, they require 300 ms to unpack 1000 billion integers (see their Fig. 11). For a larger bit width (10), their speed falls to less than 2500 mis whereas our worse overall unpacking speed is 4000 mis.
Martin says:

October 1, 2012 at 8:52 am

@Daniel, thanks for the clarification, this explains a lot since you are focussing on vertical processing instead of horizontal.
Martin says:

October 1, 2012 at 9:01 am

@Itman, what I was missing from this picture was that I mis-read the part about vertical and horizontal bit packing. I’m intrigued to try using vertical bit packing in our column store prototype and compare it to horizontal bit packing to see what happens.

Since I’m a traditional In-Memory RDBMS guy, code length never change until we recompress the whole data partition.

It looks like there is some interesting meat left that one could look at 🙂

Thanks for the great read.
Itman says:

October 1, 2012 at 11:43 am

@Martin, no problem. Daniel made a good point: I suspect that Willhalm et al does not include storing and/or preprocessing costs into timings. Essentially, they should be decompressing into L1 and discard data at some point. We, in contrast, reported mostly very conservative estimates that included storing uncompressed integers.

Since memory is slow and you have to write 4-byte integers, this limits decoding speed. Data in Fig 8b represents a scenario that should be closer to that of Willhalm et al. We do read compressed integers from memory, but this is a small cost, because integers occupy one byte on average (you can read 20 billions per second on our machine). In this scenario, we are getting 5000-7000 mis.

Again, to remove all doubts we would need to implement this approach some day. Because, we are not doing it right away, we are curious to learn your results (should you decide and try vertical layout).
Todd Lipcon says:

October 4, 2012 at 7:31 pm

Hi Daniel. Quick question: the README for the github source says the license is “APL 2.0”. Does this mean the “Apache License 2.0”?

Thanks
-Todd
Daniel Lemire says:

October 4, 2012 at 8:06 pm

@Todd

Yes, that’s what it means.
JanPierres says:

October 6, 2012 at 3:24 am

“Update: D. Lemire believes that this scheme was patented by Rose, Stepanov et al. (patent 20120221539).

We wrote this code before the patent was published (August 2012)”

does it make their patent useless?
Itman says:

October 6, 2012 at 10:50 pm

Hi JanPierres,

We found several teams that discovered this or a similar algorithm about the same time. This all happened before the patent obtained. You can check details in the paper.

Unfortunately, it is not clear how a patent would work in this case. Hopefully, it was defensive and they will not sue anybody for using this code. Otherwise, a judge will decide whether the patent is useless or not :-)))
Daniel Lemire says:

October 6, 2012 at 10:54 pm

@JanPierres

Varint-G8IU is patented but not (as far as I can tell) the other schemes we use. Their paper appeared before the patent was granted. We implemented it without knowing about the patent for the purpose of comparing performance.

Regarding prior art, what matters is the filling date of the patent application. So anyone using varint-G8IU in a commercial application could be in for a lawsuit. That is why I included this warning.

With patents, you can play the following game: you apply for a patent, then entice people to use your work without telling them about the patent, and then once the patent is granted, you are free to require licensing and sue them.

Of course, I would not want anyone to get sued because I posted software online. But my warning should keep people safe.

I think that patents are evil. See

Do we need patents?
http://lemire.me/blog/archives/2012/01/06/do-we-need-patents/
dg says:

October 8, 2012 at 9:40 pm

How does this PFOR implementation compare with LZ4 in terms of compression ratio and performance?

http://code.google.com/p/lz4/
Daniel Lemire says:

October 8, 2012 at 10:45 pm

@dg

I believe you can’t really compare something generic like LZ4 with the integer compression schemes we review in this paper. However, because we expected this sort of question, we did benchmark Google Snappy and got that it was not usable for these purposes (see the paper for details). I expect the same conclusion to hold with LZ4.
Alecco says:

January 7, 2013 at 11:12 am

This is amazing! I wish more papers were like this.
Deep says:

September 1, 2016 at 10:17 am

Can this be used for other datatypes such as double ?
1. Daniel Lemire says:
  
  September 1, 2016 at 1:55 pm
  
  Yes, but I don’t have code for this.
Pascal S. de Kloe says:

July 5, 2017 at 5:18 pm

I just published a new algorithm including a reference implementation in C and Go.
https://github.com/pascaldekloe/flit
Feedback is welcome!
https://news.ycombinator.com/item?id=14704158