I made a start at SIMD-izing the encoding; so far my best estimate is at least 16 instructions per quadword. One hitch is that there is no SIMD unsigned gt/lt compare, so instead I’m doing shift-right and compare to zero for the three different widths. The output bitmasks are conveniently -1 or 0 however, so we can just sum them together with 3 to get the 2-bit code value for each lane.
The shuffle looks like it will need a table unless there’s some fancy way to build it — that’s still on my todo list.
It seems that the current encoder/decoder are little-endian whereas the paper describes a big-endian format. I have not had time to look into the matter, but for interoperability, this matters.
One hitch is that there is no SIMD unsigned gt/lt compare
I think that if you subtract 1<<(L-1) to L-bit integers prior to doing the signed comparison, you get an unsigned comparison.
Kendall Willetssays:
I did overlook the fact that the sub would only need to happen once for the three comparisons — I’ll look at this again tonight.
I made a start at SIMD-izing the encoding; so far my best estimate is at least 16 instructions per quadword. One hitch is that there is no SIMD unsigned gt/lt compare, so instead I’m doing shift-right and compare to zero for the three different widths. The output bitmasks are conveniently -1 or 0 however, so we can just sum them together with 3 to get the 2-bit code value for each lane.
The shuffle looks like it will need a table unless there’s some fancy way to build it — that’s still on my todo list.
Wow.
Note that there is an open issue regarding endianess…
https://github.com/lemire/streamvbyte/issues/4
It seems that the current encoder/decoder are little-endian whereas the paper describes a big-endian format. I have not had time to look into the matter, but for interoperability, this matters.
One hitch is that there is no SIMD unsigned gt/lt compare
I think that if you subtract 1<<(L-1) to L-bit integers prior to doing the signed comparison, you get an unsigned comparison.
I did overlook the fact that the sub would only need to happen once for the three comparisons — I’ll look at this again tonight.
I think you can almost always get unsigned comparisons very cheaply, which is probably why Intel did not bother adding support for it.