13th March 2018, 9 min read

Iterating over hash sets quickly in Java

11 thoughts on “Iterating over hash sets quickly in Java”

Roman Leventov says:

March 13, 2018 at 10:10 pm

HashSet jumps over the memory randomly. Locality could be a big contribution factor. See https://shipilev.net/jvm-anatomy-park/11-moving-gc-locality/

It would be interesting to see how do benchmark results change if System.gc() is called right before the iteration.
1. Daniel Lemire says:
  
  March 13, 2018 at 10:11 pm
  
  HashSet jumps over the memory randomly.
  
  Can you elaborate?
  1. Roman Leventov says:
    
    March 13, 2018 at 10:20 pm
    
    When elements are inserted one after another in a HashSet, newly allocated HashMap.Entry are allocated sequentially (in a TLAB), but end up being stored at random positions in the HashMap’s array. HashSet’s (= HashMap’s) iteration goes over the array sequentially, but it means jumping over HashMap.Entries, allocated at random positions in TLAB.
    
    System.gc() may actually fix that, read the Alexey Shipilev’s post for details.
    
    LinkedHashMap links the entries in the allocation order, and iterates in the same order, it means that it scans TLAB memory sequentially.
    1. Daniel Lemire says:
      
      March 14, 2018 at 12:46 am
      I modified my code so that I call gc after creating the sets. It seems to only have a small effect. Calling the gc can even have a negative effect.
      
      I tried recreating a new hash set from the previous hash set in the hope that the data would be allocated in the right order…
      
      No luck:
      
      Benchmark (gc) Mode Samples Score Error Units m.l.m.h.OrderHash.scanHashSet false thrpt 5 51.079 Â± 0.431 ops/s m.l.m.h.OrderHash.scanHashSet true thrpt 5 68.155 Â± 0.484 ops/s m.l.m.h.OrderHash.scanHashSet2 false thrpt 5 80.694 Â± 0.295 ops/s m.l.m.h.OrderHash.scanHashSet2 true thrpt 5 71.542 Â± 0.377 ops/s m.l.m.h.OrderHash.scanLinkedHashSet false thrpt 5 165.367 Â± 2.858 ops/s m.l.m.h.OrderHash.scanLinkedHashSet true thrpt 5 173.329 Â± 2.596 ops/s
      
      See https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2018/03/13/src/main/java/me/lemire/microbenchmarks/hash/OrderHash.java
      1. Roman Leventov says:
        
        March 14, 2018 at 7:17 am
        
        Degradation is observed when a “recreated” set is iterated, as far as I can see. Possible reason might be that GC scans arrays in decreasing order (I don’t know and don’t assert that!), So GC might put objects in not very favourable order again.
        
        Anyway, it’s all just assumtions. Both of your benchmark results might be completely misleading, affected by some bugs or contribution factors that we didn’t consider yet. Only study of perfasm with cycle as well as cache miss counts could help to answer those questions for sure.
      2. Illiash says:
        
        March 14, 2018 at 1:44 pm
        
        That’s work for me by the way 😀
Me says:

March 14, 2018 at 7:18 am

Fastutil has additional implementations.
In particular, a hash set without chaining, but open addressing.

Can you include these in your benchmark? It would be interesting to see the differences even without the primitive specializations that make fastutil attractive.
1. Daniel Lemire says:
  
  March 19, 2018 at 6:31 pm
  
  Pull Requests invited!
Mario Marinero says:

March 24, 2018 at 3:56 pm

JavaScript only preserves insertion order for non-numeric properties. The actual iteration order is:

Numeric properties in ascending order.
String properties in insertion order.
Symbols.

ES 3 however described objects as “unordered”. A few years ago Firefox used insertion order for all properties and V8 browsers the current behavior. It seems ES6 finally established the iteration rules as it was causing quite a few bugs. http://2ality.com/2015/10/property-traversal-order-es6.html
1. Daniel Lemire says:
  
  March 24, 2018 at 5:06 pm
  JavaScript only preserves insertion order for non-numeric properties. (…) Numeric properties in ascending order.
  
  Can you please type this in your favorite JavaScript console:
```
$ var x = new Set()
$ x.insert(3)
$ x.insert(2)
$ x.insert(1)
$ for (let i of x) console.log(i);
```
  (Update: I had pasted Swift code initially.)
zakhar says:

September 5, 2019 at 2:59 pm

I executed test on my pc and it show the following results

Benchmark (gc) Mode Cnt Score Error Units
collectionUtils.OrderHash.scanHashSet false thrpt 5 39,380 ± 1,357 ops/s
collectionUtils.OrderHash.scanHashSet true thrpt 5 38,749 ± 0,897 ops/s
collectionUtils.OrderHash.scanHashSet2 false thrpt 5 55,521 ± 15,549 ops/s
collectionUtils.OrderHash.scanHashSet2 true thrpt 5 64,839 ± 3,105 ops/s
collectionUtils.OrderHash.scanLinkedHashSet false thrpt 5 141,776 ± 23,246 ops/s
collectionUtils.OrderHash.scanLinkedHashSet true thrpt 5 143,821 ± 1,604 ops/s

why is that scanHashSet2 is almost twice as fast as scanHashSet?
it depends on the way it was filled?