Genoil - I KNOW this is a nitpick - but I noticed as I was going through and rewriting a lot of the Ethash OpenCL... this one just bugs the shit outta me:
bool update_share = thread_id == ((a >> 2) & (THREADS_PER_HASH - 1));
Were it performance critical, that would be ouch. As is, on AMD, it just is a little eww, when you could:
bool update_share = thread_id == amd_bfe(a, 2U, 3U)
Thanks I'll have a look at it. Up until now it's been highly demotivating to try and optimize the OpenCL kernel, because nothing really made a difference. CUDA has been much more willing to give me a few % improvement over the baseline.