Re: SILENTARMY v3: now a full miner! multi-GPU, Stratum support (Linux only)

I do not see any performance degradations caused by my code supporting other NR_ROWS_LOG values (I have tried implementing only 20). Because all the non-20 cases are #ifdef'd out of the code. Plus the OpenCL compiler is very good at removing loops such as the for-loop in equihash_round() that becomes useless with 20.

This surprises me. I'll do some testing to confirm, but I don't think the the compiler will optimize away all the code that is useless with 2^20 bins.

So far it looks like you are right. I #ifdef'd out the first_words code, and the generated isa was the same size. That means it is smart enough to optimize away the assignment on line 533 when mask=0;
https://github.com/mbevand/silentarmy/blob/master/input.cl#L533

I decided to do some more optimization before posting this, and it turns out I'm right. I removed all the code I could determine to be unneeded with 2^20 bins, and now I'm getting a ~10% speed improvement. 160/s with 3x R9 380 4GB and 2x R7 370 2GB:

Code:

Total 160.7 sol/s [dev0 34.8, dev1 29.0, dev2 28.8, dev3 32.8, dev4 36.5] 5 shares
Total 160.9 sol/s [dev0 34.9, dev1 29.5, dev2 28.9, dev3 32.1, dev4 36.3] 5 shares

edit: After more testing I'd say the speed improvement is more like 15%.

Any chance this kernel will run on my 5970s? They have 1GB each....
I could reduce the kernel size, can't I? eh?