Post
Topic
Board Mining (Altcoins)
Re: SILENTARMY - v2 now supports GCN 1 GPUs (currently Linux only)
by
nerdralph
on 04/11/2016, 13:39:44 UTC
Marc, have you considered making NR_ROWS_LOG fixed at 20, and clearing cnt for each round?  Then each row would have only collisions at each round, avoiding the need to search for collisions.  Then NR_SLOTS/OVERHEAD could be significantly reduced.
To avoid time penalty for clearing the cnt values, instead of calling kernel_init_ht at each round, you could zero the count after checking the row.  When the DRAM page is already open to check the cnt value, the time cost of doing a write back to the same page (in fact, same 64-byte cache line) is minimal.

Yeah NR_ROWS_LOG is pretty much always compiled at 20. But I have to offer the other options because people want to mine with GPUs having very little memory. And I don't know if you checked the latest commits, but OVERHEAD has been lowered to 9 so 1 Equihash instance needs only 1.2 GB, and I recently made the exact change you suggested (clearing the counter after we are done using/reading it).

I think the vast majority (>95%) of miners have cards with at least 2GB RAM.  The kernel I designed (and only implemented a bit of prototype code) is quite similar to yours, but only supporting 2^20 sorting bins (or rows).  My design has 2 tables like yours (I consider it to be like double-buffering), but I hadn't thought of dividing the saved indexes between the two tables so they could fit in 28 bytes and leaving 4 bytes for the collision counter.
I did see the change of OVERHEAD from 13 to 9, but I hadn't noticed the counter reset at the end of the round.
https://github.com/mbevand/silentarmy/blob/master/input.cl#L568

Since you say you have to support options other than 2^20 rows, I'll probably fork your code and optimize it for 2^20.  I think it should give another 10% performance boost.  I also think even more performance can be attained by optimizing for the 256-byte stride size of the Polaris, Tonga, and Pitcairn GPUs.  Despite your comment in the code about odd values of OVERHEAD being best  for avoiding channel conflicts, I had found 12 was faster than 13 in my testing on Tonga and Pitcairn.