Up until C23 it was implementation defined. I expect that C23 will fix this particular issue.
In Go or Java, there is no issue.
haroldsays:
Right shift of signed integers was not UB, the spec text was “If E1 has a signed type and a negative value, the resulting value is implementation-defined”
mesays:
Maybe point out that this works well in combination with varint integers in protobuf.
Marcin Zukowskisays:
To add to other comments, I believe signed int overflows are formally undefined, and you can get overflows here. But it probably works in all “sane” compilers.
Timsays:
Is “2*x” with int x safe to do in C? Seems like you’d hit UB.
Karolsays:
It’s not, you have to cast to unsigned to make it safe.
Can you safely cast all signed integers to unsigned integers (in current C standards)?
Yes, C99 and C11 specify in Section 6.3.1.3 (Signed and unsigned integers), 2nd paragraph:
Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or
subtracting one more than the maximum value that can be represented in the new type
until the value is in the range of the new type.
So with that requirement one get bit-identical values if integers are in two’s complement – which is already assumed.
Rust has the i*::rotate_left(_) and ::rotate_right(_) functions on all integer types. So the correct encode function would be n.rotate_left(1); and decode would be n.rotate_right(1).
At least on x86_64, this maps to the respective rol/ror assembly instructions, and the compiler doesn’t need to optimize more complex code.
Lockalsays:
And speaking of vectorization in avx512, there is even vprold to convert many integers in a single instruction.
moonchildsays:
An alternative, if you know the range ahead of time, is to represent numbers with offsets from the low end of the range. This way, if (e.g.) you don’t have a lot of negative numbers, you don’t unfairly punish the positive ones by stealing a bit from them all.
Dru Nelsonsays:
There is another way to present this idea.
You are making the lowest bit the ‘sign’ bit.
This is a sign bit like a ‘sign’ bit in floating point as opposed to the method in twos-complement.
Once you see that, it is much more intuitive.
I don’t know why it isn’t presented like this more often.
You sometimes feel the need to make all of your integers positive, without losing any information. That is, you want to map all of your integers from ‘signed’ integers (e.g., -1, 1, 3, -3) to ‘unsigned integers’ (e.g., 3,2,6,7). This could be useful if you have a fast function to compress integers that fails to work well for negative integers.
I am sure it could be improved, but it is meant to provide motivation.
Ethansays:
What about the case of having unequal amounts of negative and positive integers and wanting to avoid gaps?
I.E. let’s say our initial set is. -1, -2, -3, 1, 2, 3, 4, 5, 6, 7, 8
With zig-zag encoding applied we are left with
1, 3, 5, 2, 4, 6, 8, 10, 12, 14, 16.
Which leaves us with “gaps” (below). These gaps now make the positive integers in our initial set take up more space in their binary representation.
7, 9, 11, 13, 15.
What do compression ratios end up looking like in the varying scenarios of
1. Equal amounts of negative and positive integers
2. More or less negative and positive integers relative to each other.
Can you define what you mean by “compression ratios”? The blog post does not describe a compression routine.
This being said, zigzag encoding tends to favour values that are close (in absolute value) to zero… in the sense that such values get mapped to ‘small positive integers’.
Marcin Zukowskisays:
Indeed, zigzag doesn’t necessarily compress per-se. At the same time, like @me (who’s that?:)) mentioned, it enables e.g. varint encoding (which, in my book, is also not compression, but hey 😛 )
@moonchild also mentioned adjusting the base, aka FOR encoding. If there’s a difference in positive/negative ranges, FOR indeed will create a better (smaller) range of integers. But you need to know that base upfront, which is a weakness.
In general, if someone is interested in more efficient integer compression, Daniel’s PFOR library is not the worst place to start: https://github.com/lemire/FastPFor 🙂
Ethansays:
> This could be useful if you have a fast function to compress integers that fails to work well for negative integers.
This is what motivated that question (from the top of the post). I’d also need a definition of what it means for a function that compresses integers to work or not work well for negative integers :).
I guess a clearer question is — if we’re talking about zigzag encoding in terms of being a solution to “dealing with” negative integers so that they can be compressed “well” by some function that compresses integers — Is zigzag encoding the best encoding method for “getting rid of” negative integers for whichever function that compresses integers well, but doesn’t compress negative integers well.
The second part of your response I think partially answers my question. And while it does map those negative values close to 0 to small positive integers, it does also map existing positive integers to larger positive integers.
My blog post absolutely does not answer this question:
Is zigzag encoding the best encoding method for “getting rid of” negative integers for whichever function that compresses integers well, but doesn’t compress negative integers well.
Ethansays:
Of course it doesn’t. That’s why I’m here in the comments! However the blog post does directly state that there exists, or at least probably exists, some function that compresses (positive) integers well.
This is why I was quoting this piece:
> This could be useful if you have a fast function to compress integers that fails to work well for negative integers.
I’m wondering what the usefulness you mention there is like in practice. If it’s not that important of a detail it seems like it wouldn’t be included in your post. Maybe it’s not. However, I don’t think it’s a trivial detail, which is why I’m asking questions about it.
I’ll think about this some more on my own.
Marcin Zukowskisays:
From what I saw, varint is probably the most obvious example of such a function, and that’s where I’ve seen zigzag often.
Beware that, in C, right shift with signed integers is a undef. behavior.
Up until C23 it was implementation defined. I expect that C23 will fix this particular issue.
In Go or Java, there is no issue.
Right shift of signed integers was not UB, the spec text was “If E1 has a signed type and a negative value, the resulting value is implementation-defined”
Maybe point out that this works well in combination with varint integers in protobuf.
To add to other comments, I believe signed int overflows are formally undefined, and you can get overflows here. But it probably works in all “sane” compilers.
Is “2*x” with int x safe to do in C? Seems like you’d hit UB.
It’s not, you have to cast to unsigned to make it safe.
Can you safely cast all signed integers to unsigned integers (in current C standards)?
Yes, C99 and C11 specify in Section 6.3.1.3 (Signed and unsigned integers), 2nd paragraph:
So with that requirement one get bit-identical values if integers are in two’s complement – which is already assumed.
Rust has the i*::rotate_left(_) and ::rotate_right(_) functions on all integer types. So the correct encode function would be n.rotate_left(1); and decode would be n.rotate_right(1).
At least on x86_64, this maps to the respective rol/ror assembly instructions, and the compiler doesn’t need to optimize more complex code.
And speaking of vectorization in avx512, there is even vprold to convert many integers in a single instruction.
An alternative, if you know the range ahead of time, is to represent numbers with offsets from the low end of the range. This way, if (e.g.) you don’t have a lot of negative numbers, you don’t unfairly punish the positive ones by stealing a bit from them all.
There is another way to present this idea.
You are making the lowest bit the ‘sign’ bit.
This is a sign bit like a ‘sign’ bit in floating point as opposed to the method in twos-complement.
Once you see that, it is much more intuitive.
I don’t know why it isn’t presented like this more often.
That’s not quite what it is though, is it?
I cannot fathom a practical requirement or use of this algorithm. What’s the motivation?
It is commonly used as part of compression routines.
You should start with that as a motivator, instead of burying the lede.
The first paragraph is as follows…
You sometimes feel the need to make all of your integers positive, without losing any information. That is, you want to map all of your integers from ‘signed’ integers (e.g., -1, 1, 3, -3) to ‘unsigned integers’ (e.g., 3,2,6,7). This could be useful if you have a fast function to compress integers that fails to work well for negative integers.
I am sure it could be improved, but it is meant to provide motivation.
What about the case of having unequal amounts of negative and positive integers and wanting to avoid gaps?
I.E. let’s say our initial set is. -1, -2, -3, 1, 2, 3, 4, 5, 6, 7, 8
With zig-zag encoding applied we are left with
1, 3, 5, 2, 4, 6, 8, 10, 12, 14, 16.
Which leaves us with “gaps” (below). These gaps now make the positive integers in our initial set take up more space in their binary representation.
7, 9, 11, 13, 15.
What do compression ratios end up looking like in the varying scenarios of
1. Equal amounts of negative and positive integers
2. More or less negative and positive integers relative to each other.
Can you define what you mean by “compression ratios”? The blog post does not describe a compression routine.
This being said, zigzag encoding tends to favour values that are close (in absolute value) to zero… in the sense that such values get mapped to ‘small positive integers’.
Indeed, zigzag doesn’t necessarily compress per-se. At the same time, like @me (who’s that?:)) mentioned, it enables e.g. varint encoding (which, in my book, is also not compression, but hey 😛 )
@moonchild also mentioned adjusting the base, aka FOR encoding. If there’s a difference in positive/negative ranges, FOR indeed will create a better (smaller) range of integers. But you need to know that base upfront, which is a weakness.
In general, if someone is interested in more efficient integer compression, Daniel’s PFOR library is not the worst place to start: https://github.com/lemire/FastPFor 🙂
> This could be useful if you have a fast function to compress integers that fails to work well for negative integers.
This is what motivated that question (from the top of the post). I’d also need a definition of what it means for a function that compresses integers to work or not work well for negative integers :).
I guess a clearer question is — if we’re talking about zigzag encoding in terms of being a solution to “dealing with” negative integers so that they can be compressed “well” by some function that compresses integers — Is zigzag encoding the best encoding method for “getting rid of” negative integers for whichever function that compresses integers well, but doesn’t compress negative integers well.
The second part of your response I think partially answers my question. And while it does map those negative values close to 0 to small positive integers, it does also map existing positive integers to larger positive integers.
My blog post absolutely does not answer this question:
Is zigzag encoding the best encoding method for “getting rid of” negative integers for whichever function that compresses integers well, but doesn’t compress negative integers well.
Of course it doesn’t. That’s why I’m here in the comments! However the blog post does directly state that there exists, or at least probably exists, some function that compresses (positive) integers well.
This is why I was quoting this piece:
> This could be useful if you have a fast function to compress integers that fails to work well for negative integers.
I’m wondering what the usefulness you mention there is like in practice. If it’s not that important of a detail it seems like it wouldn’t be included in your post. Maybe it’s not. However, I don’t think it’s a trivial detail, which is why I’m asking questions about it.
I’ll think about this some more on my own.
From what I saw, varint is probably the most obvious example of such a function, and that’s where I’ve seen zigzag often.
It would probably enhance zstd compression in this example:
https://lemire.me/blog/2022/11/28/generic-number-compression-zstd/
It is also obviously applicable to other formats: https://github.com/lemire/streamvbyte