25th November 2022, 18 min read

Making all your integers positive with zigzag encoding

26 thoughts on “Making all your integers positive with zigzag encoding”

Frederico L. Pissarra says:

November 25, 2022 at 5:36 pm

Beware that, in C, right shift with signed integers is a undef. behavior.
1. Daniel Lemire says:
  
  November 25, 2022 at 5:47 pm
  
  Up until C23 it was implementation defined. I expect that C23 will fix this particular issue.
  
  In Go or Java, there is no issue.
2. harold says:
  
  November 25, 2022 at 6:22 pm
  
  Right shift of signed integers was not UB, the spec text was “If E1 has a signed type and a negative value, the resulting value is implementation-defined”
me says:

November 25, 2022 at 11:26 pm

Maybe point out that this works well in combination with varint integers in protobuf.
Marcin Zukowski says:

November 26, 2022 at 6:04 am

To add to other comments, I believe signed int overflows are formally undefined, and you can get overflows here. But it probably works in all “sane” compilers.
Tim says:

November 26, 2022 at 11:53 am

Is “2*x” with int x safe to do in C? Seems like you’d hit UB.
1. Karol says:
  
  November 28, 2022 at 5:20 pm
  
  It’s not, you have to cast to unsigned to make it safe.
  1. Daniel Lemire says:
    
    November 28, 2022 at 10:39 pm
    
    Can you safely cast all signed integers to unsigned integers (in current C standards)?
    1. Georg Sauthoff says:
      
      December 28, 2022 at 11:47 pm
      
      Can you safely cast all signed integers to unsigned integers (in current C standards)?
      
      Yes, C99 and C11 specify in Section 6.3.1.3 (Signed and unsigned integers), 2nd paragraph:
      
      Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or
      subtracting one more than the maximum value that can be represented in the new type
      until the value is in the range of the new type.
      
      So with that requirement one get bit-identical values if integers are in two’s complement – which is already assumed.
Andre Bogus says:

November 26, 2022 at 4:48 pm

Rust has the i*::rotate_left(_) and ::rotate_right(_) functions on all integer types. So the correct encode function would be n.rotate_left(1); and decode would be n.rotate_right(1).

At least on x86_64, this maps to the respective rol/ror assembly instructions, and the compiler doesn’t need to optimize more complex code.
1. Lockal says:
  
  November 29, 2022 at 4:20 am
  
  And speaking of vectorization in avx512, there is even vprold to convert many integers in a single instruction.
moonchild says:

November 26, 2022 at 10:18 pm

An alternative, if you know the range ahead of time, is to represent numbers with offsets from the low end of the range. This way, if (e.g.) you don’t have a lot of negative numbers, you don’t unfairly punish the positive ones by stealing a bit from them all.
Dru Nelson says:

November 28, 2022 at 5:08 am

There is another way to present this idea.

You are making the lowest bit the ‘sign’ bit.
This is a sign bit like a ‘sign’ bit in floating point as opposed to the method in twos-complement.

Once you see that, it is much more intuitive.

I don’t know why it isn’t presented like this more often.
1. Daniel Lemire says:
  
  November 28, 2022 at 2:38 pm
  
  That’s not quite what it is though, is it?
David Plass says:

November 28, 2022 at 2:29 pm

I cannot fathom a practical requirement or use of this algorithm. What’s the motivation?
1. Daniel Lemire says:
  
  November 29, 2022 at 2:54 pm
  
  It is commonly used as part of compression routines.
  1. David Plass says:
    
    November 29, 2022 at 2:58 pm
    
    You should start with that as a motivator, instead of burying the lede.
    1. Daniel Lemire says:
      
      November 29, 2022 at 3:03 pm
      
      The first paragraph is as follows…
      
      You sometimes feel the need to make all of your integers positive, without losing any information. That is, you want to map all of your integers from ‘signed’ integers (e.g., -1, 1, 3, -3) to ‘unsigned integers’ (e.g., 3,2,6,7). This could be useful if you have a fast function to compress integers that fails to work well for negative integers.
      
      I am sure it could be improved, but it is meant to provide motivation.
Ethan says:

November 28, 2022 at 7:18 pm

What about the case of having unequal amounts of negative and positive integers and wanting to avoid gaps?

I.E. let’s say our initial set is. -1, -2, -3, 1, 2, 3, 4, 5, 6, 7, 8

With zig-zag encoding applied we are left with
1, 3, 5, 2, 4, 6, 8, 10, 12, 14, 16.

Which leaves us with “gaps” (below). These gaps now make the positive integers in our initial set take up more space in their binary representation.
7, 9, 11, 13, 15.

What do compression ratios end up looking like in the varying scenarios of
1. Equal amounts of negative and positive integers
2. More or less negative and positive integers relative to each other.
1. Daniel Lemire says:
  
  November 28, 2022 at 10:43 pm
  
  Can you define what you mean by “compression ratios”? The blog post does not describe a compression routine.
  
  This being said, zigzag encoding tends to favour values that are close (in absolute value) to zero… in the sense that such values get mapped to ‘small positive integers’.
  1. Marcin Zukowski says:
    
    November 28, 2022 at 10:56 pm
    
    Indeed, zigzag doesn’t necessarily compress per-se. At the same time, like @me (who’s that?:)) mentioned, it enables e.g. varint encoding (which, in my book, is also not compression, but hey 😛 )
    
    @moonchild also mentioned adjusting the base, aka FOR encoding. If there’s a difference in positive/negative ranges, FOR indeed will create a better (smaller) range of integers. But you need to know that base upfront, which is a weakness.
    
    In general, if someone is interested in more efficient integer compression, Daniel’s PFOR library is not the worst place to start: https://github.com/lemire/FastPFor 🙂
  2. Ethan says:
    
    November 29, 2022 at 6:12 pm
    
    > This could be useful if you have a fast function to compress integers that fails to work well for negative integers.
    
    This is what motivated that question (from the top of the post). I’d also need a definition of what it means for a function that compresses integers to work or not work well for negative integers :).
    
    I guess a clearer question is — if we’re talking about zigzag encoding in terms of being a solution to “dealing with” negative integers so that they can be compressed “well” by some function that compresses integers — Is zigzag encoding the best encoding method for “getting rid of” negative integers for whichever function that compresses integers well, but doesn’t compress negative integers well.
    
    The second part of your response I think partially answers my question. And while it does map those negative values close to 0 to small positive integers, it does also map existing positive integers to larger positive integers.
    1. Daniel Lemire says:
      
      November 29, 2022 at 7:20 pm
      
      My blog post absolutely does not answer this question:
      
      Is zigzag encoding the best encoding method for “getting rid of” negative integers for whichever function that compresses integers well, but doesn’t compress negative integers well.
      1. Ethan says:
        
        November 29, 2022 at 7:47 pm
        
        Of course it doesn’t. That’s why I’m here in the comments! However the blog post does directly state that there exists, or at least probably exists, some function that compresses (positive) integers well.
        
        This is why I was quoting this piece:
        > This could be useful if you have a fast function to compress integers that fails to work well for negative integers.
        
        I’m wondering what the usefulness you mention there is like in practice. If it’s not that important of a detail it seems like it wouldn’t be included in your post. Maybe it’s not. However, I don’t think it’s a trivial detail, which is why I’m asking questions about it.
        
        I’ll think about this some more on my own.
Marcin Zukowski says:

November 29, 2022 at 7:50 pm

From what I saw, varint is probably the most obvious example of such a function, and that’s where I’ve seen zigzag often.
1. Daniel Lemire says:
  
  November 29, 2022 at 9:16 pm
  
  It would probably enhance zstd compression in this example:
  
  https://lemire.me/blog/2022/11/28/generic-number-compression-zstd/
  
  It is also obviously applicable to other formats: https://github.com/lemire/streamvbyte