Hang on, I'm zoning in on the error. It appears to be returned from a CUDA API function so that's some good news

This issue is also reproducible on RTX 20 cards.
I'm currently busy strapping all the API calls with printf's after the error checks are triggered to see which one it is.
Cool! Hopefully your printfs do not slow it down too much causing it to work. That was our biggest pain in the .........
I've been checking our discord, but there is really nothing you didn't already know/ran into it.
I discovered that it's actually the Link Time Optimization (-dlto switch) that nvcc is doing to the object code, which is also the reason why linking takes so long, and that this feature is turned off while using the debugging switch -G. Unfortunately there is no switch that turns it off, but it shouldn't be too bad because I can successfully reproduce the bug with -b 1 -t 32 -p 1 (the threads absolutely have to be a multiple of 32 or else the program will complain).
I did not find a way to turn off