Very interesting. I get about 2% gain on 7950 and need to use (mod % 2) with the case statements adjusted accordingly.
My 280X gains almost 6% as is, but the gain difference between (mod % 2) and (mod % 4) is pretty small, like 1-2 KHs
My SMix call is a bit different, I simply put the sub-calls inline so it doesn't bother with ScratchpadStore and ScratchpadMix.
Perhaps this fits nicer into the core and needs less swapping.
I have tried, unsuccessfully, to further streamline the SMix, but any other way I do it, its either all HW errors or vastly slower. Any guidance here would be appreciated.
How do you adjust the case 2 accordingly?
(number % power_of_two) should be a shooting offense when coding for a brain dead compiler.