The latest version (2011-08-04) has a major problem that I can see.
The assumption that there won't be more than 1 valid nonce per kernel execution is very wrong. At aggression 14 for example each kernel execution tests 2^30 nonces. The chance that there will be more than 1 valid nonce in any given kernel execution in this case is going to be about 2.5% (if I did the math right) This effectively causes a net loss in performance compared to the previous version at high aggression. At lower aggression values (10 and below) this is less of a problem since the performance loss in these cases will be much less than 1%.
You have to compare the loss of valid nonces to the higher efficiency because of the removed control flow in the kernel (all current GPUs dislike if/else and so on). I thought this tradeoff would be well worth it, but you could prove me wrong. I was thinking about a better way of writing the positive nonces into output, but that didn't work.
Any good ideas for that part of the kernel will be a big plus!
Dia