11th April 2014, 7 min read

Probabilities and the C++ standard

7 thoughts on “Probabilities and the C++ standard”

Max Lybbert says:

April 11, 2014 at 4:55 pm

Since the hash function isn’t specified, it seems to me that an implementation that picks a new hash function each time the program runs would be standard compliant. One way to “pick a new hash function on each run” would be to have the hash function rely on a random number that is allowed to differ between runs. However, I don’t know enough to say whether that actually would comply.

But, even assuming that the hash function isn’t allowed to change between runs (i.e., it’s “set in stone”), I don’t think the standard prohibits a hash table from re-ordering elements in a bucket when they are accessed (move-to-front). Which would have the effect that iterating the hash table multiple times could return different orders even if there have been no insertions or deletions. But, again, I don’t know enough of the standard to say so definitively.
Geoff Hill says:

April 11, 2014 at 9:59 pm

The C++11 standard you link already gives you the capability to specify a custom hash function.

Examine the template type parameters for the unordered_map class:

class Key,
class T,
class Hash = hash,
class Pred = std::equal_to,
class Allocator = std::allocator

The third type parameter is Hash, and you can supply a custom type (a functor) to instantiate it.

See this StackOverflow answer for a quick walkthrough for implementating custom hashing and key-equality functions:
http://stackoverflow.com/a/15809592/1301580
Daniel Lemire says:

April 11, 2014 at 5:54 pm

@Max

have the hash function rely on a random number

They go out of the way to specify that “h(k) shall depend only on the argument k” so my interpretation is that it cannot depend on a random seed. (Update: Of course, what can they do to prevent you from doing exactly what you describe?)

I don’t think the standard prohibits a hash table from re-ordering elements in a bucket when they are accessed

Yes, the unordered map can certainly do something like this but this would still mean that the same program, with the same inputs, would list the keys in the same order.

Once you have random hashing, running the same program with the same inputs twice would give you different results.

This might throw some people off.
Max Lybbert says:

April 11, 2014 at 7:05 pm

> Once you have random hashing, running the same program with the same inputs twice would give you different results.

> This might throw some people off.

There’s no question about that. The Perl guys originally turned random hashing on everywhere, but that broke several modules that relied on deterministic ordering of a hash’s contents, so they changed things to only use random hashing when they detected large numbers of collisions.

But it turned out that wasn’t random enough, so they recently randomized it more ( http://blog.booking.com/hardening-perls-hash-function.html ). If there wasn’t a valid security rationale, I doubt the changes would have stayed. As it is, they upset several programmers by breaking expectations.

I’m not on the committee, but in my opinion, it would be a mistake to add any requirements on the hashing function other than “it’s idempotent (per program run)” and “it minimizes collisions.”
Max Lybbert says:

April 11, 2014 at 7:09 pm

Sorry, “idempotent” is the wrong term. I guess the best term is “it’s deterministic (per run of the program).” But, usually, you have to say “NOTE: the function *may* return different results for different runs of the program” to really get the point across.
Daniel Lemire says:

April 11, 2014 at 10:18 pm

@Geoff

Thanks. Yes, I was aware that, with some hard work, one could provide his own hash functions. And as I hint in my answer to Max, I believe that you can ignore the specification when building your own functions.
Yves Orton says:

April 22, 2014 at 5:59 am

In regard to “that might throw some people off”, I wanted to say that the Perl experience is that this is a *good* thing. It prevents a developer from relying on insertion order into the hash (many if not most hash implementations are sensitive to this). And it helps find edge case scenarios most developers would never thing to test.

Most devs think that making things random makes it harder to reproduce bugs. The perl experience is the opposite, the randomization effectively tries out every permutation of keys, and if the code is sensitive to this it very quickly becomes apparent. Run it enough times and the failure will eventually occur.