Daniel Lemire's blog

, 7 min read

Do you realize that you are using random hashing?

10 thoughts on “Do you realize that you are using random hashing?”

  1. Alexey Diomin says:

    Hi

    Main problem for hash it’s collisions resolution. Random hash it’s attempt exclude algorithmic attacks on hash table, because mostly collision resolution based on linked-list. But in java HashMap from jdk8 exist 2 different strategies: for buckets with low collisions use LinkedList, for buckets with high collisions use Balanced Tree. Strategies can switch in runtime based on statistics for buckets.

    http://openjdk.java.net/jeps/180

    =) I think it’s solved issue with security and in mostly attacks on hashtable.

    What you think about this technique?

  2. @Alexey

    Yes, in Java, they have introduced specific strategies to alleviate problems resulting from collision attacks in the HashMap class. On the surface, it sounds reasonable and should help people who rely on HashMap. However, programmers use the hashCode method to create their own data structures and algorithms, and some use data structures from other libraries. By not adopting random hashing, Java puts such programmers at risk.

  3. Alexey Diomin says:

    hi

    But how handle correct matching for distributed cache if every jvm will be use custom hashCode.

    Method hashCode can be override if you create own data structures (and it’s necessary, because supertype return wrong value for your instance). For Object.hashCode you can change behavior over -XX:hashCode=n

    Yes, you can’t override hashCode for internal java classes from jdk, but it’s really so important?

  4. Alexey Diomin says:

    Clustering for memcache apply on client side, also on client side we find necessary shard by hash. Each Memcache server it’s very simple key-value storage. Any changes for distribution apply in client code and “yes” we can change algo for hashing, but it’s make some value unreachable, because we will ask incorrect server. Can you get me link for random hashing for Memcache?

  5. Alexey Diomin says:

    Small example:

    N – count of servers in cluster,
    we try get data from server

    serverId = hash(object) % N

    but if we change hash function between runs or on different client then we can’t get data back =\

  6. @Alexey

    Well, for distributed systems, you have other constraints. For example, you should use consistent hashing… so whatever Java or Python provides is not good enough in the first place.

  7. Alexey Diomin says:

    Summary:
    1) internal hash never used for distributed systems
    2) algorithmic attacks on hash table not critical because we can auto-switch on tree for resolve collisions. (HashMap and ConcurrentHashMap implement this, 3th part implementation maybe yes, maybe not)
    3) for custom object we always have custom implementation for hashCode, because vm don’t know which field must be used for hash function
    4) Object.hashCode we always can change on start vm

    One place where random hash can be apply it’s internal classes from jvm, but problem really so important as some people thinking?

  8. @Alexey

    The important point is that, as a programmer, you need to know about random hashing as it is now widespread.

  9. Alexey Diomin says:

    I agreed with you, about every good programmer must know about random hashing (and I know about it), but I not agreed with necessary implement it in jvm.

  10. @Alexey

    Sure, we can debate whether the core java classes should implement random hashing. I personally think that they should. There is no real downside for competent programmers and increased security as an upside.