RX480 with amdgpu-pro 16.30
Total 55.8 sol/s [dev0 54.0] 18 shares
Total 55.3 sol/s [dev0 52.4] 18 shares
Total 55.6 sol/s [dev0 54.7] 18 shares
Total 55.9 sol/s [dev0 55.7] 18 shares
Total 55.0 sol/s [dev0 55.7] 18 shares
Total 55.5 sol/s [dev0 56.2] 18 shares
Total 55.2 sol/s [dev0 56.1] 19 shares
Total 54.6 sol/s [dev0 54.8] 19 shares
Total 54.9 sol/s [dev0 55.3] 19 shares
Total 55.1 sol/s [dev0 53.1] 19 shares
Total 54.4 sol/s [dev0 52.6] 19 shares
Kernel:
http://coinsforall.io/distr/input.cl.coll1 NVidia also have speedup.
I reduced number of collisions to found from 5 to 1, it seems 5 is too much, need mrb's comments.
I've just finished testing this, and simply changing the collisions array from 5 to 1 gives a few percent increase in performance.
This also shows that there is the potential for additional performance increases by improving the pruning of duplicate indices. For a couple days I've been trying to think of a way to do this quickly (i.e. without expanding indexes), but don't have a solution yet.
Also tested reducing the collision search space for later rounds using the formula:
min(cnt, (uint)(NR_SLOTS -(uint)(round-4)))
No reduction in the number of solutions generated, but no material improvement in speed either.