Hi Christian, me again. dunno if im barking up the wrong tree here but if i reduce the amount of cuda threads, i.e:
'case 16: fermi_scrypt_core_kernelA<16><<< grid, threads, 0, stream >>>(d_idata); break;'
Say i set threads to say 256 (<512). theres a massive increase in speed... But quite a few errors.
Why the errors?
You're asking basically: if I break the program there's errors. Why are there errors?
short answer: because you broke it
long answer: because you only compute half the requested results with 256 threads.