I wonder how lookup tables and block compression would perform in comparison to AVX. (compression as all states are probably not equally likely)

foobarsays:

I wonder how complicated the logic of computing new cell liveness value would be using binary logic – essentially reading cell and its neighborhood in nine (interleaved) AVX registers compromising of 2304 cell values and computing 256 cell liveness updates per step using essentially circuit logic of ANDs, ORs, XORs and such.

foobarsays:

Unsurprisingly somebody has had a look at this already, and apparently 35 logic gates are enough for the task:

Considering consecutive cells have four shared neighbors, this could be reduced to maybe 30 ANDs/ORs per round of 256 cells. (Further reductions are possible, but not really practical on circuit level.) Modern CPUs can do three such operations per clock cycle, and if register pressure doesn’t turn out to be a problem and overhead related to edges of vectors could be handled, this ineffective-sounding approach might be almost tenfold times faster than vectorized variant in the blog post!

But this is a back-of-an-envelope estimate without any code written for the task.

I would guess that representing each row as bits in a SIMD register would yield good results, especially because we can (with a little adjustment) reuse the 3×1 sum 3 times (for the row above, below and the row itself).

Peter McNeeleysays:I wonder how lookup tables and block compression would perform in comparison to AVX. (compression as all states are probably not equally likely)

foobarsays:I wonder how complicated the logic of computing new cell liveness value would be using binary logic – essentially reading cell and its neighborhood in nine (interleaved) AVX registers compromising of 2304 cell values and computing 256 cell liveness updates per step using essentially circuit logic of ANDs, ORs, XORs and such.

foobarsays:Unsurprisingly somebody has had a look at this already, and apparently 35 logic gates are enough for the task:

https://www.moria.us/old/3/programs/life/

Considering consecutive cells have four shared neighbors, this could be reduced to maybe 30 ANDs/ORs per round of 256 cells. (Further reductions are possible, but not really practical on circuit level.) Modern CPUs can do three such operations per clock cycle, and if register pressure doesn’t turn out to be a problem and overhead related to edges of vectors could be handled, this ineffective-sounding approach might be almost tenfold times faster than vectorized variant in the blog post!

But this is a back-of-an-envelope estimate without any code written for the task.

Peter Turneysays:Golly uses a hashlife algorithm. I’ve been using it lately and finding it very useful. It can be astoundingly fast for some patterns.

https://en.wikipedia.org/wiki/Golly_(program)

http://golly.sourceforge.net/

http://conwaylife.com/w/index.php?title=Golly

Geoff Langdalesays:I would guess that representing each row as bits in a SIMD register would yield good results, especially because we can (with a little adjustment) reuse the 3×1 sum 3 times (for the row above, below and the row itself).

Charly Csays:I did just that in this gist: https://gist.github.com/CharCoding/52fb584fab2d3632fe2225880890463e

It’s a 256×256 board that wraps around, and the board is printed as braille pixels.