Edit:
Attempts to log, gather more info or whatever, will slow down the program just enough that it *works*.

Prob. the reason it works on older hardware. Its really the speed killing it.
Maybe the SMs in the GPU have so many threads per block running that they don't have enough time to align all the pointers for Bitcrack. The GPU does a lot of its own things at runtime. I never saw bitcrack do any alignment in the cracking loop, but neither did I check the parts responsible for initialization.
Could the initialization stage be launching kernels with invalid block sizes in a grid? Maybe there's a way to make it launch a 1x1 sized grid so we can see what happens.