I discovered that it's actually the Link Time Optimization (-dlto switch) that nvcc is doing to the object code that's the culprit and messing with CUDA API functions, which is also the reason why linking takes so long, and that this feature is turned off while using the debugging switch -G. Unfortunately there is no switch that turns it off, but it shouldn't be too bad because I can successfully reproduce the bug with -b 1 -t 32 -p 1 (the threads absolutely have to be a multiple of 32 or else the program will complain).
Isn't this the same thing as for Legacy Mode execution? Would at least explain why legacy was running (just horribly slow).
Legacy mode is really just no-frills CUDA compilation without fancy optimizations behind your back.
The optimizations are what's killing the program this is why it can only be reproduced in PTX and not in source code. Think about it, even a T4 running Bitcrack compiled with compute 5.2 is crashing. It means that something has changed in the way newer GPUs get their code compiled that makes bad PTX.
The annoying part is that the offending region of memory is 32-bit aligned, and there's unsigned int arrays everywhere. Also today's debugging efforts have come to nothing so I think I'll try arming BSGS/Kangaroo/VanitySearch with Bitcrack-like searches instead.