Post
Topic
Board Mining (Altcoins)
Re: SILENTARMY v5: Zcash miner, 115 sol/s on R9 Nano, 70 sol/s on GTX 1070
by
nerdralph
on 13/11/2016, 13:59:46 UTC
SA5 has no more easy optimizations left; the current version maxes out the memory bandwidth of the card.

Does it also max out the PCI-E bus bandwidth? I'm trying to understand why my Nitro+ RX470 (at 75~80 Sol/s) is finding as many shares as my R9 295x2 (2 x 90 Sol/s)...  Any idea?

With the on-GPU solution pruning it barely uses any PCI-e bus bandwidth.  My comment about maxing out the memory bandwidth is a simplification of a complex problem.  The memory path includes L1 & L2 cache as well as the external GDDR5.  I believe parts of the current performance bottleneck are due to L1/L2 cache thrashing.  There's at least a couple ways of solving the problem, but none (that I can think of) are easy.


I was profiling it yesterday, the limiting factor right now is LDS. To optimize it, total memory accessed by the round kernel wave would have to be reduced. Currently it falls into about 16K. CodeXL says that utilization during rounds is 12.5% due to this. I can send you more data if you want or you can profile it yourself with CodeXL 2.2.

I don't use profilers.  Unless you really understand the code and underlying architecture, and how the profiler works, you can easily get mislead by a red herring.  Here's what I use:
AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps
The AMD GCN architecture docs, and some tweaks are all I need to analyze performance.

The current bottleneck in ht_store could be avoided without using LDS, and I already discussed this with eXtremal.  It would involve a 16-way store operation where each thread stores 1 uint to 2 slots (even if only one slot is actually modified).  The 64-byte write would match the cache line and avoid L1/L2 cache thrashing.  See the GCN architecture doc:
"Cache lines are written back to the L2 when all 64 stores in a wavefront instruction have finished. Lines with all dirty data are also kept in the L1D, while any partially clean line is evicted from the L1D."