On the original version the program is using 683blocks and 768 threads per block.
With your modification it is using 32x15=480 and 768 thread/block
However the number of thread is 524288, which in my opininon in the reason why I get
"the does not validate on cpu" and why 683 got chosen, since it is just thread/thread_per_block
This gives me around 36MHash/s
Yes, your numbers are correct, though it is not as simple as dividing the number of total threads by desired threads per block. Not all of the 524288 threads can be executed simultaneously; max resident threads for 3.x-5.x devices is 2048/SM (10240 on 750 Ti for example). However, they can be
scheduled, and are processed once resources become available as previous tasks complete.
I have the feeling it is faster because it throws away a lot of things...
This is indeed what happens when you get the "does not validate" error. The CPU tries to recreate the hash one last time before submitting it as proof, and it gets dropped if it fails validation. Work in this case is simply trashed. I have not finished instrumenting the code fully to provide exact details. What I do have is verification from pools through higher reported hashrate (calculated from rate of valid shares) and in particular a correlated increase in valid share counts.
Would be interesting to have Christian opinion on that.
In there a way to decrease the number of thread ? (assuming it works) ?
Agreed, I will 100% defer to Christian on this subject
