Daniel Lemire's blog

, 9 min read

Converting floating-point numbers to integers while preserving order

11 thoughts on “Converting floating-point numbers to integers while preserving order”

  1. George Spelvin says:

    It might be worth mentioning the similar code for signed integers, which seem a more natural target for signed floating-point values:

    int64_t sign_flip(int64_t x) {
    uint64_t mask = x >> 62; // Or 63, your choice
    return x ^ (mask >> 1);

    Here, you never flip the sign bit, so the mask needs to cover only the low 63 bits.

    On most modern machines, shift instructions are smaller and faster than large immediate constants, but there are some processors (particularly synthesizable cores for FPGA implementation) which have slow shifts. In such cases, an alternative is:

    uint64_t mask = x & ((uint64_t)1 << 63);
    return x ^ (mask - 1);

    1. Very nice indeed.

  2. aqrit says:

    Is it possible to compare signed integers with a floating-point compare, or do some bit-patterns not work?

    Context: SSE2 has a 64-bit floating-point compare _mm_cmpgt_pd() but has only 32-bit integer compares.

    Current work-around:

    __m128i pcmpgtq_sse2 (__m128i a, __m128i b) {
    __m128i r = _mm_and_si128(_mm_cmpeq_epi32(a, b), _mm_sub_epi64(b, a));
    r = _mm_or_si128(r, _mm_cmpgt_epi32(a, b));
    return _mm_shuffle_epi32(r, _MM_SHUFFLE(3,3,1,1));

    1. Per Vognsen says:

      There’s a simple combinatorial reason you can’t express a 64-bit integer comparison, signed or unsigned, with a 64-bit floating-point comparison: some floating-point numbers “are the same” as far as float comparisons are concerned, notably all the NaN bit patterns, and hence by the pigeonhole principle any mapping from 64-bit ints to 64-bit floats must map some ints to values that won’t order correctly.

      But if you have 64-bit uints where the two most significant bits are 0 (i.e. values in the range 0 to 2^62), so the float sign is positive and the exponent can’t be the all 1s pattern which signifies a NaN, you should be able to load it as a 64-bit float and get isomorphic comparisons. If you want 2’s complement signed comparisons this way, I think you’d have to do a pre-transform.

      1. aqrit says:

        Thank You.

  3. Falk Hüffner says:

    This can be done with two instructions less:

    bool generic_comparator(double x1, double x2) {
    uint64_t i0 = to_uint64(x1);
    uint64_t i1 = to_uint64(x2);
    uint64_t mask = uint64_t(int64_t(i0) >> 63);
    mask |= 0x8000000000000000;
    return (i0 ^ mask) < (i1 ^ mask);

    1. Yes, that’s credible. Thanks.

  4. James Sadler says:

    Thanks for the great post – this is going to come in very handy!

    I assumed the sign_flip function would be reversible by changing the direction of the bit shift. This seems to work for some values, but not quite. Negative values seem to be incorrect by a small amount. It feels like the most insignificant bit is incorrect in the result. Can you see what I’m doing wrong here?

    uint64_t inverse_sign_flip(uint64_t x) {
    // credit http://stereopsis.com/radix.html
    // when the most significant bit is set, we need to
    // flip all bits
    uint64_t mask = uint64_t(int64_t(x) << 63);
    // in all case, we need to flip the most significant bit
    mask |= 0x8000000000000000;
    return x ^ mask;

    1. Can you elaborate on how you arrive at this inverse function? What is your expectation with respect to what is in the mask variable?

      1. James Sadler says:

        My goal is to implement a function that when passed the return value of sign_flip as its argument it returns the original value.

        The following must hold true: i.e. b = sign_flip(a) then a = inverse_sign_flip(b).

        My bit twiddling skills are not my string point, however the above code that I pasted seems to almost work – the result looks incorrect on the least significant bit.

        I wouldn’t be able to explain how it worked backwards, but I am surprised the result was so close to being correct for all values I’ve thrown at it (-ve, +ve, large and small numbers).

        1. I would recommend against doing it with trial and error…

          I have not tested it, but it should be something like that…

             uint64_t mask = uint64_t(int64_t(x ^ 0x8000000000000000) >> 63);
             mask |= 0x8000000000000000;
             return x ^ mask;

          Basically, you can recover the most significant bit by flipping it. It if is 1, you need to flip all bits… if it is 0, you need to flip just the most significant bit. I think that the code above does that… but I am not sure, you should check.