On the original version the program is using 683blocks and 768 threads per block.
With your modification it is using 32x15=480 and 768 thread/block
However the number of thread is 524288, which in my opininon in the reason why I get
"the does not validate on cpu" and why 683 got chosen, since it is just thread/thread_per_block
This gives me around 36MHash/s
I changed 768 by 512 and then I get 39~40MHash/s
no rejected, however high rate of "does not valitate".
Which means large fraction of the shares are just thrown
I have the feeling it is faster because it throws away a lot of things...
Would be interesting to have Christian opinion on that.
In there a way to decrease the number of thread ? (assuming it works) ?