The annoying part is that the offending region of memory is 32-bit aligned
Vector instructions need 16byte alignment.
In bitcrack sp-mod #5 ~66% of the time is used to multuply numbers. Pretty stupid algorithm. With tensor cores enabled, might push the hashrate abit.
So dynamically allocating all the blasted local variables should solve it then since those are on 256-bit boundaries.