SA5 has no more easy optimizations left; the current version maxes out the memory bandwidth of the card.
Does it also max out the PCI-E bus bandwidth? I'm trying to understand why my Nitro+ RX470 (at 75~80 Sol/s) is finding as many shares as my R9 295x2 (2 x 90 Sol/s)... Any idea?
With the on-GPU solution pruning it barely uses any PCI-e bus bandwidth. My comment about maxing out the memory bandwidth is a simplification of a complex problem. The memory path includes L1 & L2 cache as well as the external GDDR5. I believe parts of the current performance bottleneck are due to L1/L2 cache thrashing. There's at least a couple ways of solving the problem, but none (that I can think of) are easy.