...
I want to understand because I don't have this problem with the same kernels with my platform test with pyCuda.
Mate, if I knew precisely why these cards were jacking out, I wouldn't have taken so many months to roll out
my own patch. The truth is, the best I know about this problem is the
cicc compiler from CUDA is mixing around function PTX definitions as part of its "optimization" and somewhere along that line, it places some piece of function on a 32-bit boundary (apparently the RTX cards no longer support 32-bit boundaries).
That's why completely disabling cicc optimization fixes the problem albeit kills the performance of Bitcrack. It is how my aforementioned patch works, but clearly richie's is superior to mine in terms of speed.