Re: limits of ZEC mining

Quote from: xeridea on November 19, 2016, 01:17:17 AM

Quote from: nerdralph on November 18, 2016, 02:59:31 PM

Yesterday I realized I forgot to account for the 1:1 command:burst ratio. Because of that, doing 64-byte random writes will be no faster than 128-byte random writes. It should still be faster than the 3 read + 2 write algorithm, as there will be no write-after-read delays, and the high page hit rate of reading the tables will free up some command slots for the random writes. About 4.2 cache line transfers per row is what I figure, or 269MB per round and 2.42GB per iteration. That's 173 sols/s for the Rx 470 before refresh overhead, or 160 sols/s using the 93% ratio. Claymore V6 does about 140 sols/s on the Rx 470, so there's only about 10-15% more room for improvement.

Claymore 7.0 I get around 155 Sol/s on 470 4GB, and 164 Sol/s on 480 4GB, default clocks. So either V7 is at/above theoretical limit, there are additional tricks to get speed, or theoretical limit is actually higher.

I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated. If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%. It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it. I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements. I doubt anyone will find a serious mistake now. Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga. That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes. With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record. A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.