Note I don't expect tromp's Cuckoo hash to be GPU resistant, because it can be (even if sublinear) parallelized. I don't know if anyone has tested parallelizing it on more than the two dozen cores he tested?
I tested it for him on 32 cores. I don't know the results you would have to ask him (I just ran some code he gave me on an idle system).
I summarized the results in my private response on May 1, reproduced below.
Note that the version you tested is the older one that didn't include edge trimming.
From all my benchmarking, I notice that AMD opteron servers, as well as Intel Xeon
servers with quad and octa cpus, appear to saturate memory at under 32 threads.
Dual-cpu Xeons appear to have a superior (at least for Cuckoo Cycle) memory subsystem,
that allows them to perform well beyond 32 threads.
----------------------------------------------------------------------------
Thanks, smooth!
As I feared, Xeons perform much worse in octa-cpu configuration
than in dual-cpu. The speedup flattens out at about 16 threads.
If you run the speedup test on one of your dual-Xeon systems
with 16+ (hyper)threads, you'll see performance scaling much
better and presumably not flattening out,
like the green speedup30 curve in my paper at page 6 bottom left.
My favourite test target would be a system with dual
Xeon Ey-28(70/80/90) that can run 60 hyperthreads...
-John