Re: SILENTARMY - v2 now supports GCN 1 GPUs (currently Linux only)

Quote from: mrb on November 04, 2016, 07:10:39 AM

Quote from: nerdralph on November 03, 2016, 03:37:42 PM

Marc, have you considered making NR_ROWS_LOG fixed at 20, and clearing cnt for each round? Then each row would have only collisions at each round, avoiding the need to search for collisions. Then NR_SLOTS/OVERHEAD could be significantly reduced.
To avoid time penalty for clearing the cnt values, instead of calling kernel_init_ht at each round, you could zero the count after checking the row. When the DRAM page is already open to check the cnt value, the time cost of doing a write back to the same page (in fact, same 64-byte cache line) is minimal.

Yeah NR_ROWS_LOG is pretty much always compiled at 20. But I have to offer the other options because people want to mine with GPUs having very little memory. And I don't know if you checked the latest commits, but OVERHEAD has been lowered to 9 so 1 Equihash instance needs only 1.2 GB, and I recently made the exact change you suggested (clearing the counter after we are done using/reading it).

I think the vast majority (>95%) of miners have cards with at least 2GB RAM. The kernel I designed (and only implemented a bit of prototype code) is quite similar to yours, but only supporting 2^20 sorting bins (or rows). My design has 2 tables like yours (I consider it to be like double-buffering), but I hadn't thought of dividing the saved indexes between the two tables so they could fit in 28 bytes and leaving 4 bytes for the collision counter.
I did see the change of OVERHEAD from 13 to 9, but I hadn't noticed the counter reset at the end of the round.
https://github.com/mbevand/silentarmy/blob/master/input.cl#L568

Since you say you have to support options other than 2^20 rows, I'll probably fork your code and optimize it for 2^20. I think it should give another 10% performance boost. I also think even more performance can be attained by optimizing for the 256-byte stride size of the Polaris, Tonga, and Pitcairn GPUs. Despite your comment in the code about odd values of OVERHEAD being best for avoiding channel conflicts, I had found 12 was faster than 13 in my testing on Tonga and Pitcairn.