I do not see any performance degradations caused by my code supporting other NR_ROWS_LOG values (I have tried implementing only 20). Because all the non-20 cases are #ifdef'd out of the code. Plus the OpenCL compiler is very good at removing loops such as the for-loop in equihash_round() that becomes useless with 20.
This surprises me. I'll do some testing to confirm, but I don't think the the compiler will optimize away all the code that is useless with 2^20 bins.
So far it looks like you are right. I #ifdef'd out the first_words code, and the generated isa was the same size. That means it is smart enough to optimize away the assignment on line 533 when mask=0;
https://github.com/mbevand/silentarmy/blob/master/input.cl#L533I decided to do some more optimization before posting this, and it turns out I'm right. I removed all the code I could determine to be unneeded with 2^20 bins, and now I'm getting a ~10% speed improvement. 160/s with 3x R9 380 4GB and 2x R7 370 2GB:
Total 160.7 sol/s [dev0 34.8, dev1 29.0, dev2 28.8, dev3 32.8, dev4 36.5] 5 shares
Total 160.9 sol/s [dev0 34.9, dev1 29.5, dev2 28.9, dev3 32.1, dev4 36.3] 5 shares
edit: After more testing I'd say the speed improvement is more like 15%.
Any chance this kernel will run on my 5970s? They have 1GB each....