I discovered that it's actually the Link Time Optimization (-dlto switch) that nvcc is doing to the object code that's the culprit and messing with CUDA API functions, which is also the reason why linking takes so long, and that this feature is turned off while using the debugging switch -G. Unfortunately there is no switch that turns it off, but it shouldn't be too bad because I can successfully reproduce the bug with -b 1 -t 32 -p 1 (the threads absolutely have to be a multiple of 32 or else the program will complain).
Isn't this the same thing as for Legacy Mode execution? Would at least explain why legacy was running (just horribly slow).