22nd October 2015, 7 min read

Faster hashing without effort

11 thoughts on “Faster hashing without effort”

Leonid Boytsov says:

October 23, 2015 at 12:16 am

Looks like an excellent candidate for vectorization.
Jason Schulz says:

October 23, 2015 at 8:02 am

The complexity of a hash function isn’t always O(1), and sometimes the hash function has a larger complexity than the actual hash operations.

It’s weird that the security of a hash almost directly corresponds to the number of data dependencies.
Daniel Lemire says:

October 23, 2015 at 3:08 pm

@Leonid

Indeed, hashing can be vectorized, see our recent paper: http://arxiv.org/abs/1503.03465 For long strings, you can nearly hash 8 bytes per CPU cycle!
1. J. Andrew Rogers says:
  
  October 26, 2015 at 10:34 pm
  
  Daniel, you might want to check out the MetroHash algorithm family. The fastest variants can do >8 bytes/cycle on long strings without vectorization (slow variants are no worse than 5 bytes/cycle) and are among the fastest functions for short keys as well. More importantly, the randomness is extremely robust, similar to SHA-1.
  
  While vectorized variants of MetroHash are on my TODO list, a subtle design element of the MetroHash functions is that they are engineered to saturate multiple ALU ports on modern super-scalar architectures, which gives vector-like performance without vectorization. Main repository is here:
  
  https://github.com/jandrewrogers/MetroHash
  
  The algorithm is surprisingly simple, a minimal sequence of multiplications and rotates, but the anomalously strong statistical properties are the product of extensive computer analysis.
  1. Daniel Lemire says:
    
    October 27, 2015 at 12:08 am
    
    Do you have some documentation beside the source code? E.g., regarding the statistical properties?
    1. J. Andrew Rogers says:
      
      October 27, 2015 at 6:55 am
      
      I do have quite a bit of data on these functions, I will see if I can dig some up. They have been subjected to much more than the standard SMHasher diagnostics and analysis.
      
      The backstory is that I designed software that in theory could learn how to design and optimize hash functions within some simple constraints. An enormous amount of compute time later, the software was churning out hundreds of algorithms that were far better than the existing art. More interesting to me, a small subset of the functions appear at least as robustly random across a very wide range of tests as cryptographic functions, which I can’t quite explain yet.
Leonid Boytsov says:

October 23, 2015 at 9:08 pm

@Daniel, very nice. You could have mentioned this in the post. 🙂
1. Daniel Lemire says:
  
  October 23, 2015 at 10:08 pm
  
  You are right. I might write a more involved blog post later on fast “advanced” techniques to compute hash functions.
Viktor Szathmary says:

October 26, 2015 at 7:07 am

Can you get a speedup for byte[] or ByteBuffer hashing with a similar technique? (My first attempt was not fruitful)
1. Viktor Szathmary says:
  
  October 26, 2015 at 7:15 am
  
  Correction: for a byte[] the speedup is there, but for ByteBuffer it’s not (unless you get the backing byte[] of course), probably due some overhead of calling buf.get(i).
  1. Daniel Lemire says:
    
    October 26, 2015 at 1:34 pm
    
    Unfortunately, a ByteBuffer is an expensive abstraction in Java. Arrays have been optimized in Java to the point where they have basically the same speed as in C++, most of the time. Not so with ByteBuffer.